How does SHAP values help us to determine importance of a feature for a model trained by gradient boost?

Question

I've read http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf and https://medium.com/@gabrieltseng/interpreting-complex-models-with-shap-values-1c187db6ec83 which is like a summary of the first link.

In general I didn't understand anything about how does SHAP values helps and how it helps us determine importance of features from the first paper. At the second article he has a very simple decision tree and calculates shapely value for a feature for a specific training example. It doesn't say that at the end which value determines its importance(i.e mean of a feature for every training example. I don't know). Or why this works.

And there is a confusion between them. First the first article uses SHAP values which defined as "Shapley values of a conditional expectation function of the original model" at the second article it just uses shapley values.

I read several academic papers and website articles but I couldn't address to my question. Most of websites deal with its framework application anyways. If you can explain or give a useful resource I would be appreciated.

score 0 · Accepted Answer · answered Aug 20 '19 at 16:01

I think that the decision tree that appears in the second article is just illustrating the xgboost model that the shap is applied on.

I would like to suggest you to read Christoph Molnar tutorial book on explainability, especially the chapters about Shapley and about the shap algorithm.

The first term (shapely) helps to decompose the effect of each feature on the predicted output by trying all the combinations ("coalitions") of features with the input values (vector). By that, the method allows deconstructing how the output of the model been constructed by each of the input feature values. This approach looks on the features as players that their different strategies affect the final prediction, and it originated in the game theory domain.

The second term, SHAP, is basically calculating the same shapely values with some extensions for missing values that are not handled well in the first and improvements for the algorithm that estimates the shapely values themselves both by speed (the Kernel function that approximates them) and quality of the approximations (the weighting method that combines the combinations).

Besides does algorithmic differences, I think that the shap method is also the approach of using the Shapley algorithm for the usage of interpreting ML (tree-based) models, mainly, by exploiting their additivity property to describe the prediction as a decomposition of the sum of features contributions.

How does SHAP values help us to determine importance of a feature for a model trained by gradient boost?

1 Answers1