Is XGBoost better with numeric predictors?

Question

I have a categorical feature that I one-hot encoded and used in my XGBoost model, but it consistently underperforms as a predictor compared to the other predictors.

Then I created a new variable that contains the same kind of information that the categorical feature has...

Imagine I'm interested in predicting price of a house and the categorical feature is the town that it is in, and the numerical feature is a feature I generated ranking the town's relative expensiveness, based on some prior knowledge of the towns.

TownA - 100
TownB - 40
TownC - 65
TownD - 15

Now all of a sudden the new numerical variable that was directly derived from the categorical one is outperforming. Is this because XGBoost just works better with numerical variables and maybe some predictive capability is lost when I one-hot encode the variable?

Could it be that by inputting these numbers you are adding a lot of information which is helping greatly in the regression? How are the other classifiers doing when going from town category to town expensiveness? — Learning is a mess, Mar 21 '18 at 13:35
What you have done is a very basic form of what's called "impact encoding" - a technique of converting categorical features to numeric. — bradS, Apr 06 '18 at 13:31
what do you mena exactly by "consistently underperforms as a predictor compared to the other predictors"? — aivanov, May 16 '18 at 15:29
@aivanov it does not have as much gain as the other predictors — conv3d, May 16 '18 at 20:30

score 5 · Answer 1 · answered Mar 16 '18 at 21:25

By default, feature importance in xgboost is given by how many times a given feature appears as a split feature across all trees in the ensemble.

When one-hot encoded, each newly created dummy variable can only take the values 0 and 1, and so can only appear once in each (sub)tree. However, when combining the values into one numeric feature by giving each category a different value, the feature can appear many more times on different levels in each tree, which brings up the importance score.

I see. It also reduced my error metric though when I cross validated. — conv3d, Mar 16 '18 at 23:01

Is XGBoost better with numeric predictors?

1 Answers1