3

I want to implement my own version of the CART Decision Tree from scrach (to learn how it works) but I have some trouble with the Gini Index, used to express the purity of a dataset.

More precisely, I don't understand how Gini Index is supposed to work in the case of a regression tree.

The few descriptions I could find describe it as :

gini_index = 1 - sum_for_each_class(probability_of_the_class²)

Where probability_of_the_class is just the number of element from a class divided by the total number of elements.

But I can't use this definition in the case of regression where I have continuous variables.

Is there something I misunderstood here ?

Nakeuh
  • 238
  • 1
  • 4
  • 11

1 Answers1

2

In regression trees, sum of squared error (SSE) is the criterion for tree split. The first split is based on the feature/predictor and its values in your training set that yields the lowest SSE value. And then so on for the further splits.

Srikrishna
  • 146
  • 2
  • So in the case I have a dataset that contains both continuous and categorical data, I should use a different criterion for tree split depending of the variable type that is tested ? – Nakeuh Jul 18 '18 at 15:03
  • 3
    The criteria depends on target variable, not the predictors – Srikrishna Jul 18 '18 at 15:09