9

I need to quantify the importance of the features in my model. However, when I use XGBoost to do this, I get completely different results depending on whether I use the variable importance plot or the feature importances.

For example, if I use model.feature_importances_ versus xgb.plot_importance(model) I get values that do not align. Presumably the feature importance plot uses the feature importances, bu the numpy array feature_importances do not directly correspond to the indexes that are returned from the plot_importance function.

Here is what the plot looks like:

enter image description here

But this is the output of model.feature_importances_ gives entirely different values:

array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.00568182,  0.        ,  0.        ,  0.        ,
        0.13636364,  0.        ,  0.        ,  0.        ,  0.01136364,
        0.        ,  0.        ,  0.        ,  0.        ,  0.07386363,
        0.03409091,  0.        ,  0.00568182,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.00568182,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.00568182,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.01704546,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.05681818,  0.15909091,  0.0625    ,  0.        ,
        0.        ,  0.        ,  0.10227273,  0.        ,  0.07386363,
        0.01704546,  0.05113636,  0.00568182,  0.        ,  0.        ,
        0.02272727,  0.        ,  0.01136364,  0.        ,  0.        ,
        0.11363637,  0.        ,  0.01704546,  0.01136364,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ], dtype=float32)

If I just try to grab Feature 81 (model.feature_importances_[81]), I get:0.051136363. However model.feature_importances_.argmax() returns 72.

So the values do not correspond to each other and I am unsure about what to make of this.

Does anyone know why these values are not concordant?

NLR
  • 191
  • 1
  • 1
  • 2
  • Welcome to the site! XGBoost produces multiple measures of feature "importance" (3 actually). Check that the same type of feature importances are being output. – bradS Jul 10 '18 at 07:43
  • 1
    Good idea @bradS. I'll take a closer look. Any idea how to specify the type in for model.feature_importances_? I know how to specify it with xgb.plot_importances(model), but it is not clear if you can change it with the .feature_importances_ method. – NLR Jul 10 '18 at 19:48
  • This suggests using model.booster().get_score(importance_type='weight')... I'd wager changing the importance_type will solve your issue. – bradS Jul 11 '18 at 08:17

1 Answers1

15

In xgboost 0.7.post3:

  1. XGBRegressor.feature_importances_ returns weights that sum up to one.

  2. XGBRegressor.get_booster().get_score(importance_type='weight') returns occurrences of the features in splits. If you divide these occurrences by their sum, you'll get Item 1. Except here, features with 0 importance will be excluded.

  3. xgboost.plot_importance(XGBRegressor.get_booster()) plots the values of Item 2: the number of occurrences in splits.

  4. XGBRegressor.get_booster().get_fscore() is the same as XGBRegressor.get_booster().get_score(importance_type='weight')

Method get_score returns other importance scores as well. Check the argument importance_type.

In xgboost 0.81, XGBRegressor.feature_importances_ now returns gains by default, i.e., the equivalent of get_score(importance_type='gain'). See importance_type in XGBRegressor.

So, for importance scores, better stick to the function get_score with an explicit importance_type parameter.

Also, check this question for the interpretation of the importance_type parameter: "weight", "gain", and "cover".

Anton Tarasenko
  • 711
  • 8
  • 13