I'm trying to understand how to fully understand the decision process of a decision tree classification model built with sklearn. The 2 main aspect I'm looking at are a graphviz representation of the tree and the list of feature importances. What I don't understand is how the feature importance is determined in the context of the tree. For example, here is my list of feature importances:
Feature ranking: 1. FeatureA (0.300237)
FeatureB (0.166800)
FeatureC (0.092472)
FeatureD (0.075009)
FeatureE (0.068310)
FeatureF (0.067118)
FeatureG (0.066510)
FeatureH (0.043502)
FeatureI (0.040281)
FeatureJ (0.039006)
FeatureK (0.032618)
FeatureL (0.008136)
FeatureM (0.000000)
However, when I look at the top of the tree, it looks like this:
In fact, some of the features that are ranked "most important" don't appear until much further down the tree, and the top of the tree is FeatureJ which is one of the lowest ranked features. My naive assumption would be that the most important features would be ranked near the top of the tree to have the greatest impact. If that's incorrect, then what is it that makes a feature "important"?