Interpreting Decision Tree in context of feature importances

Question

I'm trying to understand how to fully understand the decision process of a decision tree classification model built with sklearn. The 2 main aspect I'm looking at are a graphviz representation of the tree and the list of feature importances. What I don't understand is how the feature importance is determined in the context of the tree. For example, here is my list of feature importances:

Feature ranking: 1. FeatureA (0.300237)

FeatureB (0.166800)
FeatureC (0.092472)
FeatureD (0.075009)
FeatureE (0.068310)
FeatureF (0.067118)
FeatureG (0.066510)
FeatureH (0.043502)
FeatureI (0.040281)
FeatureJ (0.039006)
FeatureK (0.032618)
FeatureL (0.008136)
FeatureM (0.000000)

However, when I look at the top of the tree, it looks like this:

In fact, some of the features that are ranked "most important" don't appear until much further down the tree, and the top of the tree is FeatureJ which is one of the lowest ranked features. My naive assumption would be that the most important features would be ranked near the top of the tree to have the greatest impact. If that's incorrect, then what is it that makes a feature "important"?

how many samples get assigned to the left and right of the first node? — oW_, Feb 02 '17 at 01:03

Outcast · Accepted Answer · 2018-08-29T16:48:39.087

It is not necessary that the more important a feature is then the higher its node is at the decision tree.

This is simply because different criteria (e.g. Gini Impurity, Entropy-Information Gain, MSE etc) may be used at each of two these cases (splitting vs importance).

For example, at SkLearn you may choose to do the splitting of the nodes at the decision tree according to the Entropy-Information Gain criterion (see criterion & 'entropy' at SkLearn) while the importance of the features is given by Gini Importance which is the mean decrease of the Gini Impurity for a given variable across all the trees of the random forest (see feature_importances_ at SkLearn and here).

If I am right, at SkLearn the same applies even if you choose to do the splitting of the nodes at the decision tree according to the Gini Impurity criterion while the importance of the features is given by Gini Importance because Gini Impurity and Gini Importance are not identical (see also this and this on Stackoverflow about Gini Importance).

Already @oW_ has given a rather correct answer to the original question but I thought that it is good to write it in a more concise and lucid way for the reader. — Outcast, Aug 29 '18 at 16:25

score 3 · Answer 2 · answered Feb 02 '17 at 21:28

In scikit-learn the feature importance is the decrease in node impurity. The key is that it measures the importance only at a node level. Then, all the nodes are weighted by how many samples reach that node.

So, if only a few samples end up in the left node after the first split, this might not mean that J is the most important feature because the gain on the left node might only affect very few samples. If you additionally print out the number of samples in each node you might get a better picture of what is going on.

score 2 · Answer 3 · answered Feb 02 '17 at 02:19

Just because a node is lower on the tree does not necessarily mean that it is less important. The feature importance in sci-kitlearn is calculated by how purely a node separates the classes (Gini index). You will notice in even in your cropped tree that A is splits three times compared to J's one time and the entropy scores (a similar measure of purity as Gini) are somewhat higher in A nodes than J.

However, if you could only choose one node you would choose J because that would result in the best predictions. But if you were to have the option to have many nodes making several different decisions A would be the best choice.

So in layman's terms, assuming there are only 2 possible classifications (let's call them 0 and 1), the feature at the base of the tree will be the one that best splits the samples out into the 2 groups (i.e. the best job of splitting the 1's onto one side of the tree and the 0's into the other). Is that accurate?
I'm still not totally clear on what feature importance is ranking if it's not the best at separating the 0s and 1s in this context — Tim Lindsey, Feb 02 '17 at 20:46

SmallChess · Answer 4 · 2017-02-02T02:55:15.190

-2

Variable importance is measured by decrease in model accuracy when the variable is removed. The new decision tree created with the new model without the variable could look very different to the original tree. Splitting decision in your diagram is done while considering all variables in the model.

What variable to split at the root (and other nodes) is measured by impurity. Good purity (e.g: everything in the left branch has the same target value) is not a guarantee for good accuracy. You data might be skewed, your right branch have more responses than your left branch. Therefore, it's no good just correctly classify the left branch, we also need to consider the right branch as well. Therefore, the splitting variable might or might not be an important variable for overall model accuracy.

Variable importance is a better measure for variable selection.

edited Feb 02 '17 at 02:55

answered Feb 02 '17 at 01:42

SmallChess

3,540
2
18
30

2

I don't think that's how it is implemented in scikit-learn. There, feature importance is measured as "gini importance", i.e. total decrease in node impurity weighted by the proportion of samples reaching that node. – oW_ Feb 02 '17 at 17:56
"The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance." - http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html – Afflatus Feb 22 '17 at 19:20

Interpreting Decision Tree in context of feature importances

4 Answers4

Linked