Decision tree, how to understand or calculate the probability/confidence of prediction result

Question

For example, a drug prediction problem using a decision tree. I trained the decision tree model and would like to predict using new data.

For example:

patient, Attr1, Attr2, Attr3, .., Label
002      90.0   8.0    98.0 ...   ? ===> predict drug A

How can I calculate the confidence or probability of the prediction result of drug A?

Look into bootstrap aggregation, or "bagging"; cf. e.g. Confidence Intervals for Random Forests or Prediction intervals for Random Forests — Emre, May 31 '16 at 19:05
It sounds like this is for a classification problem only, not regression, is that right? If so I'd recommend clarifying in the question title or body. — Max Ghenis, Dec 03 '18 at 03:39
@MaxGhenis I guess decision trees imply classification naturally, no? — Rafs, Jun 08 '22 at 09:06

Ricardo Cruz · Accepted Answer · 2022-06-08T13:13:42.603

11

What data mining package do you use?

In sklearn, the DecisionTreeClassifier can give you probabilities, but you have to use things like max_depth in order to truncate the tree. The probabilities that it returns is $P=n_A/(n_A+n_B)$, that is, the number of observations of class A that have been "captured" by that leaf over the entire number of observations captured by that leaf (during training). But again, you must prune or truncate your decision tree, because otherwise the decision tree grows until $n=1$ in each leaf and so $P=1$.

That being said, I think you want to use something like a random forest. In a random forest, multiple decision trees are trained, by using different resamples of your data. In the end, probabilities can be calculated by the proportion of decision trees which vote for each class. This I think is a much more robust approach to estimate probabilities than using individual decision trees.

But random forests are not interpretable, so if interpertability is a requirement, use the decision tree like I mentioned. You can use grid search to maximize the ROC AUC score by changing hyperparameters such as maximum depth to find whatever decision tree gives the most reliable probabilities.

edited Jun 08 '22 at 13:13

answered May 30 '16 at 15:01

Ricardo Cruz

3,410
1
15
34

A link to random forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html Or go to another implementation in the blog post – borgr Jan 21 '21 at 15:06
Very nice answer. Your blogpost link is broken though.. – Rafs Jun 08 '22 at 09:05
@RTD, thank you, my blog is no longer up. I have removed the link. – Ricardo Cruz Jun 08 '22 at 13:14
A pity. I was really interested to see why a decision tree isn't recommended to estimate probabilities. I thought if we put the accuracy of the model in mind, and look at the probabilities, we can have a good representation of what the underlying probabilities are. We can say: with an accuracy of 70%, the positive class probability at this leaf is 0.8. Perhaps we can scale the probabilities by the accuracy - eccentric I know, just a rough idea. Not sure if you can elaborate on that, but thank you in any case. – Rafs Jun 09 '22 at 09:21

Decision tree, how to understand or calculate the probability/confidence of prediction result

1 Answers1

Linked