Gini Index in Regression Decision Tree

Question

I want to implement my own version of the CART Decision Tree from scrach (to learn how it works) but I have some trouble with the Gini Index, used to express the purity of a dataset.

More precisely, I don't understand how Gini Index is supposed to work in the case of a regression tree.

The few descriptions I could find describe it as :

gini_index = 1 - sum_for_each_class(probability_of_the_class²)

Where probability_of_the_class is just the number of element from a class divided by the total number of elements.

But I can't use this definition in the case of regression where I have continuous variables.

Is there something I misunderstood here ?

score 2 · Accepted Answer · answered Jul 18 '18 at 14:48

2

In regression trees, sum of squared error (SSE) is the criterion for tree split. The first split is based on the feature/predictor and its values in your training set that yields the lowest SSE value. And then so on for the further splits.

answered Jul 18 '18 at 14:48

Srikrishna

146
2

So in the case I have a dataset that contains both continuous and categorical data, I should use a different criterion for tree split depending of the variable type that is tested ? – Nakeuh Jul 18 '18 at 15:03
3

The criteria depends on target variable, not the predictors – Srikrishna Jul 18 '18 at 15:09

Gini Index in Regression Decision Tree

1 Answers1