Two questions about one-hot encoding: drop first? and features with thousands of categories

Question

I have two questions about one-hot feature encoding:

(1) Is it considered a "best practice" to drop the first (or at least one) one-hot encoded feature when one-hot encoding, like you would when creating dummy variables for linear regression modelling in classical statistics? It seems ML practitioners do this both ways; does any definitive guidance exist?

(2) What the best way is to handle the one-hot encoding of a categorical variable with thousands of features (around ~6000)? This number is particularly high given the dataset is only about ~10 features wide before one-hot encoding. Note that the categories are quite evenly distributed among the features.

Questions on stackexchange sites should be limited to one specific question. Your two questions already exist independently, e.g. https://datascience.stackexchange.com/q/27957/55122 and https://datascience.stackexchange.com/q/10509/55122 (and spend a little time exploring Related/Linked questions, and also stats.stackexchange). — Ben Reiniger, Jun 12 '21 at 22:51

Nikos M. · Answer 1 · 2021-06-11T17:00:14.050

One-hot (resp. one-cold) encoding creates co-linearity if all the features are used. Simply because the following relation always holds: $\sum_i f_i = 1$.

So dropping one feature destroys the colinearity and it can have better results, since many models (esp. linear models) get confused with colinearities in the features.

If a categorical variable has many values, then one-hot (resp. one-cold) is not the best option to use and the results will suffer big time because of the curse of dimensionality.

Instead other encodings can be used like LabelEncoder which can in fact provide state-of-the-art results despite popular wisdom.

References:

Two questions about one-hot encoding: drop first? and features with thousands of categories

1 Answers1