11

For example, a drug prediction problem using a decision tree. I trained the decision tree model and would like to predict using new data.

For example:

patient, Attr1, Attr2, Attr3, .., Label
002      90.0   8.0    98.0 ...   ? ===> predict drug A

How can I calculate the confidence or probability of the prediction result of drug A?

Ethan
  • 1,633
  • 9
  • 24
  • 39
GoingMyWay
  • 233
  • 1
  • 2
  • 9

1 Answers1

11

What data mining package do you use?

In sklearn, the DecisionTreeClassifier can give you probabilities, but you have to use things like max_depth in order to truncate the tree. The probabilities that it returns is $P=n_A/(n_A+n_B)$, that is, the number of observations of class A that have been "captured" by that leaf over the entire number of observations captured by that leaf (during training). But again, you must prune or truncate your decision tree, because otherwise the decision tree grows until $n=1$ in each leaf and so $P=1$.

That being said, I think you want to use something like a random forest. In a random forest, multiple decision trees are trained, by using different resamples of your data. In the end, probabilities can be calculated by the proportion of decision trees which vote for each class. This I think is a much more robust approach to estimate probabilities than using individual decision trees.

But random forests are not interpretable, so if interpertability is a requirement, use the decision tree like I mentioned. You can use grid search to maximize the ROC AUC score by changing hyperparameters such as maximum depth to find whatever decision tree gives the most reliable probabilities.

Ricardo Cruz
  • 3,410
  • 1
  • 15
  • 34
  • A link to random forest: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html Or go to another implementation in the blog post – borgr Jan 21 '21 at 15:06
  • Very nice answer. Your blogpost link is broken though.. – Rafs Jun 08 '22 at 09:05
  • @RTD, thank you, my blog is no longer up. I have removed the link. – Ricardo Cruz Jun 08 '22 at 13:14
  • A pity. I was really interested to see why a decision tree isn't recommended to estimate probabilities. I thought if we put the accuracy of the model in mind, and look at the probabilities, we can have a good representation of what the underlying probabilities are. We can say: with an accuracy of 70%, the positive class probability at this leaf is 0.8. Perhaps we can scale the probabilities by the accuracy - eccentric I know, just a rough idea. Not sure if you can elaborate on that, but thank you in any case. – Rafs Jun 09 '22 at 09:21