4

I'm training a classifier and I want to collect incorrect outputs for human to double check.

the output of the classifier is a vector of probabilities for corresponding classes. for example, [0.9,0.05,0.05]

This means the probability for the current object being class A is 0.9, whereas for it being the class B is only 0.05 and 0.05 for C too.

In this situation, I think the result has a high confidence. As A's probability dominants B's and C's.

In another case, [0.4,0.45,0.15], the confidence should be low, as A and B are close.

What's the best formula to use to calculate this confidence?

Bill Yan
  • 171
  • 1
  • 2
    There's no best formula, this is heuristic based. It depends on what accuracy you're looking for, but if you want a place to start, consider anything > 0.85 in the correct class a confident prediction, anything between 0.3 and 0.85 low confidence, and anything beneath 0.3 wrong – Recessive Mar 03 '20 at 05:46
  • 1
    If you have a well-calibrated method that does indeed output probabilities, your problem is already solved. In the first instance, the classifier says there's a 90% probability that the object belongs to class A, but in the second instance, it's only 45% sure that it belongs to class B. What more do you want? – Nuclear Hoagie Mar 03 '20 at 19:39
  • After your human curators have marked a sample of the output, you can use https://scikit-learn.org/stable/modules/calibration.html – chrishmorris Mar 29 '21 at 16:00

4 Answers4

1

I believe that there is no "best formula" here, as there are many Calibration metrics out there, depending on what you want to calibrate. This paper introduces three metrics for different purposes:

  • Expected Calibration Error (ECE): provides a single scalar summary of calibrations.
  • Maximum Calibration Error (MCE): use when we wish to minimize the worst-case deviation between confidence and accuracy
  • Negative log likelihood (NLL): this is the same as Cross-entropy loss.

There is also a related paper about more metrics.

Just like Accuracy, F1, and ROC-AUC, Calibration metric should depend on the use case.

Minh-Long Luu
  • 1,140
  • 3
  • 20
0

Obvious answer for a binary (2 classes)classification is .5. Beyond that the earlier comment is correct. One of the things I have seen done is to run your model on the test set and save the prediction probability results. Then create a threshold variable call it thresh. Then increment thresh from 0 to 1 in a loop. On each iteration compare thresh with the highest predicted probability prediction call it P. If P>thresh declare that as the selected prediction, then compare that with the true prediction. Keep track of the errors for each value of thresh. At the end select the value of thresh that has the least errors. There are also some more sophisticated methods for example "top 2 accuracy" where thresh is selected based on having the true class within either the prediction with the highest probability or the second highest probability . You can construct a weighted error function and select the value of thresh that has the net lowest error over the test set. For example an error function might be as follows. If neither P(highest) or P(second highest) = True class, error=1. If P(second highest) = true class, error=.5. If p(highest)=true class error=0. I have never tried this myself so I am not sure how well this works. When I get some time will try it on a model with 100 classes and see how well it does. I know in the Imagenet competition they evaluate not just the top accuracy but also the "Top 3" and "Top 5' accuracy. In that competition there are 1000 classes. I never thought of this before but I assume you could train your model specifically to optimize say the Top 2 accuracy by constructing a loss function used during training that forces the network to minimize this loss.

Gerry P
  • 714
  • 4
  • 11
0

I assume you want a model that uses the Softmax as the output layer.

Basically, the Softmax will produce a set of probabilities that all sum up to 1. So if you have three classes in your data the Softmax will produce these confidence values by default, even though this is not exactly its main functionality.

The Softmax is commonly used on multiclass data.

Marcus
  • 226
  • 1
  • 7
0

Maybe its possible to use Entropy to evaluate confidence?

import torch

def entropy(cls_prob:torch.tensor)->torch.tensor: """ Calculate entropy The larger the number, the lower the confidence level.

Args:
    cls_prob (torch.tensor): The output score of the AI ​​model, 
    the expected input shape is (batch size, number of categories)

Returns:
    torch.tensor: entropy of each sample
"""
score = -torch.sum(torch.log(cls_prob)*cls_prob, dim=1)
return score

prob_score = torch.tensor([[0.9,0.05,0.05], [0.4,0.45,0.15]])
confidence = entropy(prob_score)

and the confidence be like tensor([0.3944, 1.0104])