I have two questions about one-hot feature encoding:
(1) Is it considered a "best practice" to drop the first (or at least one) one-hot encoded feature when one-hot encoding, like you would when creating dummy variables for linear regression modelling in classical statistics? It seems ML practitioners do this both ways; does any definitive guidance exist?
(2) What the best way is to handle the one-hot encoding of a categorical variable with thousands of features (around ~6000)? This number is particularly high given the dataset is only about ~10 features wide before one-hot encoding. Note that the categories are quite evenly distributed among the features.