34

The problem refers to decision trees building. According to Wikipedia 'Gini coefficient' should not be confused with 'Gini impurity'. However both measures can be used when building a decision tree - these can support our choices when splitting the set of items.

1) 'Gini impurity' - it is a standard decision-tree splitting metric (see in the link above);

2) 'Gini coefficient' - each splitting can be assessed based on the AUC criterion. For each splitting scenario we can build a ROC curve and compute AUC metric. According to Wikipedia AUC=(GiniCoeff+1)/2;

Question is: are both these measures equivalent? On the one hand, I am informed that Gini coefficient should not be confused with Gini impurity. On the other hand, both these measures can be used in doing the same thing - assessing the quality of a decision tree split.

Damien
  • 341
  • 1
  • 3
  • 3
  • I came to this question looking for a definition: https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity – Martin Thoma Oct 19 '17 at 11:30

6 Answers6

34

No, despite their names they are not equivalent or even that similar.

  • Gini impurity is a measure of misclassification, which applies in a multiclass classifier context.
  • Gini coefficient applies to binary classification and requires a classifier that can in some way rank examples according to the likelihood of being in a positive class.

Both could be applied in some cases, but they are different measures for different things. Impurity is what is commonly used in decision trees.

Sean Owen
  • 6,595
  • 6
  • 31
  • 43
  • "Gini coefficient applies to binary classification"? According to the links in the question, the Gini coefficient is a measure of diversity used in economics. Could you add some references to your answer? – Marsellus Wallace Dec 16 '20 at 00:01
  • It is used that way, but also as a metric for binary classifiers. Later on the wiki, for example, it mentions its relation to AUC: https://en.wikipedia.org/wiki/Gini_coefficient#Relation_to_other_statistical_measures – Sean Owen Dec 16 '20 at 18:48
6

I took an example of Data with two people A and B with wealth of unit 1 and unit 3 respectively. Gini Impurity as per Wikipedia = 1 - [ (1/4)^2 + (3/4)^2 ] = 3/8

Gini coefficient as per Wikipedia would be ratio of area between red and blue line to the total area under blue line in the following graph

enter image description here

Area under red line is 1/2 + 1 + 3/2 = 3

Total area under blue line = 4

So Gini coefficient = 3/4

Clearly the two numbers are different. I will check more cases to see if they are proportional or there is an exact relationship and edit the answer.

Edit: I checked for other combinations as well, the ratio is not constant. Below is a list of few combinations I tried. enter image description here

Gaurav Singhal
  • 263
  • 1
  • 3
  • 11
1

I believe they represent the same thing essentially, as the so-called:

"Gini Coefficient" mainly used in Economics, measures the inequality of a numerical variable, such as income, which we can treat as a regression problem--getting the "mean of each group.

"Gini impurity" mainly used in Decision Tree learning, measures the impurity of a categorical variable, such as colour, sex, etc. which is a classification problem -- getting the "majority" of each group.

Sounds similar right? "inequality" and "impurity" are both measures of variation, which are intuitively the same concept. The difference is "inequality" for numerical variables and "impurity" for categorical variables. And both of them can be named "Gini Index".


In Light, R. J., & Margolin, B. H. (1971). An analysis of variance for categorical data, it says that as the "mean" is an undefined concept for categorical data, Gini extends the "Gini Index" from numerical data to categorical data by using pairwise difference instead of deviation from mean. TL;DR which comes to the variation for categorical responses: $$\frac1{2n}[\sum_{i\neq j}n_in_j] = \frac{n}2 - \frac1{2n}\sum^I_{i=1}n_i^2$$ where $n_i$ is the number of responses in the $i$th category, $i = 1, \cdot\cdot\cdot, I$ which is almost the same, but $\frac{n}2$ times the "Gini Impurity" nowadays, $$1 - \sum^{I}_{i=1} {p_i}^{2}$$


By the way, you said you can use ROC as method 2 to choose split point when growing a decision tree, I can't get it. Could you elaborate that?

PS: I agreed with Pasmod Turing's answer, that Wikipedia can be modified by everyone, and the "Gini Impurity" seems like an incomplete item in the wiki.

I also saw the disputes in the comments under his answer, I must say Machine Learning is originated from statistics, and statistics is the fundamental analysis tool for scientific research, thus, many concepts are the same thing in statistics, even though they have different names in different professional areas. Gini index certainly share the same name in decision tree and economics.

Jokerkeny
  • 11
  • 2
  • Remember, the context here is decision trees. "Gini index" as used in economics (though this was not the question) is most analogous to "Gini coefficient" as used in machine learning, because it depends on pairwise comparisons. AUC may be interpreted as the probability a positive instance is deemed more likely to be positive than a negative instance, and Gini coefficient = 2 x AUC - 1. But Gini impurity is something else entirely, akin to an entropy measure. I wouldn't put stock in an answer that offers no references but asks you distrust others. – Sean Owen Mar 01 '20 at 01:35
0

I think they both represent the same concept.

In classification trees, the Gini Index is used to compute the impurity of a data partition. So Assume the data partition D consisiting of 4 classes each with equal probability. Then the Gini Index (Gini Impurity) will be: $Gini(D) = 1 - (0.25^2 + 0.25^2 + 0.25^2 + 0.25^2)$

In CART we perform binary splits. So The gini index will be computed as the weighted sum of the resulting partitions and we select the split with the smallest gini index.

So the use of Gini Impurity (Gini Index) is not limited to binary situations.

Another term for Gini Impurity is Gini Coefficient which is used normally as a measure of income distribution.

Ethan
  • 1,633
  • 9
  • 24
  • 39
Pasmod Turing
  • 463
  • 2
  • 6
  • 4
    Gini coefficient is not Gini impurity. See the links in the question – Sean Owen Sep 10 '14 at 19:15
  • 2
    Wikipedia ist not always a reliable source of information :-) – Pasmod Turing Sep 11 '14 at 13:40
  • 3
    Sure. Go look it up somewhere else: http://mathworld.wolfram.com/GiniCoefficient.html What makes you think Gini coefficient = Gini impurity? – Sean Owen Sep 11 '14 at 14:03
  • Look it up: http://books.google.de/books?id=DQXhYAgXRpEC&pg=PA373&dq=gini+coefficient+classification&hl=de&sa=X&ei=Q7sRVOKDOcvAPLDugbAP&ved=0CCkQ6AEwAQ#v=onepage&q=gini%20coefficient%20classification&f=false – Pasmod Turing Sep 11 '14 at 15:10
  • By the way: I have never seen a publication that cites mathworld.wolrfram.com ! – Pasmod Turing Sep 11 '14 at 15:12
  • 1
    I'm sure you can find people that use coefficient and impurity interchangeably within the field of ML. Gini coefficient is not misclassification error, outside of ML. It has a meaning, whose interpretation when brought back into ML is something else. If the question is, are they the same thing, then I can't see how the answer is "yes". Search for "gini coefficient" and tell me those are all about misclassification error? – Sean Owen Sep 11 '14 at 16:01
  • 1
    I think we are talking about decision trees. So we are in the field of machine learning! Please read the question more carefully – Pasmod Turing Sep 19 '14 at 12:38
  • 1
    No need to guess. Did you click the links? Do you see that the gini coefficient in question is not the thing you keep talking about? – Sean Owen Sep 20 '14 at 13:21
  • It was not a guess ;-) I am pretty sure about it. – Pasmod Turing Sep 26 '14 at 08:54
  • 1
    this is not constructive at this point. I am happy to let anyone read the question and links, my answer, and your comments, and decide what the word means. – Sean Owen Sep 26 '14 at 08:57
0

Gini impurity is a special instance of Gini coefficient:

This is Gini coefficient's definition in Wikipedia:

In economics, the Gini coefficient (/ˈdʒiːni/ JEE-nee), also known as the Gini index or Gini ratio, is a measure of statistical dispersion intended to represent the income inequality or the wealth inequality within a nation or a social group.

In another words, it measures the inequality of the wealth of each person in a nation, with the constraint that the sum of their wealth is a constant.

Now replace the above bolded words with:

person -> category
wealth -> probability
nation -> probability distribution
constant -> 1

The above sentence become: it measures the inequality of the probability of each category in a probability distribution , with the constraint that the sum of their probability is 1.

That's exactly the definition of Gini impurity!

0

Gini index of 1 would represent wealth concentration to a single person. However, the Gini impurity in this case would be 0. So, they should move in opposite directions, right?

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. – Community Mar 26 '23 at 08:57