5

I am having a look at this material and I have found the following statement:

For this class of models [Gradient Boosting Machine algorithms] [...] it is both safe and significantly more computationally efficient use an arbitrary integer encoding [also known as Numeric Encoding] for the categorical variable even if the ordering is arbitrary [instead of One-Hot encoding].

Do you know some references that support this statement? I get that Numeric Encoding is more computationally efficient than One-Hot Encoding, but I would like to know more about their supposed equivalence to encode unordered categorical variables in Gradient Boosting Methods.

Thanks!

carlo_sguera
  • 151
  • 3

1 Answers1

5

This is actually a feature of tree-based models in general, not just gradient boosting trees.

Not exactly a reference, but this Medium article explains why ordinal encoding is often more efficient.

On the topic of safety, I think the author should have said that the use of ordinal encoding is more safe compared to linear methods, but still not perfectly safe. It's possible for decision-tree methods to find spurious rules within ordinal encodings, but they don't have the strong assumptions about numeric semantics that linear methods do.

. . . I would like to know more about their supposed equivalence to encode unordered categorical variables . . .

Any rule derived with one-hot encoding can also be represented with ordinal encoding, it just might take more splits.

To illustrate, suppose you have a categorical variable foo with possible values spam, ham, eggs. A one-hot encoding would create 3 dummy variables, is_spam, is_ham, is_eggs. Let's say an arbitrary ordinal encoding assigns spam = 1, ham = 2, and eggs = 3.

Suppose the OHE decision tree splits on is_eggs = 1. This can be represented in the ordinal decision tree by the split foo > 2. Suppose the OHE tree splits on is_ham = 1. The ordinal tree will require two splits: foo > 1 then foo < 3

zachdj
  • 2,684
  • 6
  • 13
  • Thanks for your answer! I will digest this new information and I will come back ;-) – carlo_sguera Jul 22 '20 at 07:35
  • 2
    Starting from the link you @zachdj posted, after a couple of steps I have found this article: https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/. It is more focused on one-hot encoding and random forests, however I think that for the moment I am quite satisfied and I draw the following conclusion: one-hot encoding may behaves so poorly in tree-based methods that label/numeric encoding may outperform it. I leave the question open just in case we receive more insights. Thanks again! – carlo_sguera Jul 22 '20 at 16:14
  • 4
    For those who view this question but don't want to read the articles, here's the upshot: One-hot encoding introduces a lot of sparse binary variables; this hurts runtime performance because the decision tree has more variables to consider when splitting. It also can hurt accuracy because you're less likely to get a high-purity split early in the tree with OHE vs. ordinal encoding. – zachdj Jul 22 '20 at 16:58