1

I am reading these articles (see below), which advocate the use of numerical encoding rather than one hot encoding for better interpretability of feature importance output from ensemble models. This goes against everything I have learnt - won't Python treat nominal features (like cities, car make/model) as ordinal if I encode them into integers?

https://krbnite.github.io/The-Quest-for-Blackbox-Interpretability-Take-1/

https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

  • 1
    I only had a quick look but I have serious doubts about the evidence provided in the second link. The experiment is way too simplistic, apparently they don't even consider different possible distributions of the categorical variable, only its cardinality and distribution of the target. – Erwan Jan 05 '22 at 01:01
  • You only specified ensemble models, but that seems irrelevant. Are you asking about tree models (and esp. ensembles of those) here? – Ben Reiniger Jan 05 '22 at 21:40
  • I've seen the second article before and agree with Erwan, although I also think it provides some useful information. I think the first article is better, but still swings a little too far into the anti-one-hot camp. Probably the reality is quite nuanced. As for their argument about Gini importance specifically, I think they've missed an opportunity to think about aggregating the importances of the dummy variables into an importance of the original (since the importance here is a sum of impurity decreases anyway); does that produce nonintuitive feature importances? – Ben Reiniger Jan 05 '22 at 22:54
  • See also https://datascience.stackexchange.com/q/77880/55122 and the linked questions from there. – Ben Reiniger Jan 05 '22 at 22:54

0 Answers0