0

I am trying to evaluate the accuracy of a multiclass classification setting and I'm wondering why the sklearn implementation of the accuracy score deviates from the commenly agreed on accuracy score: $\frac{TP+TN}{TP+TN+FP+FN}$

For sklearn the sklearn.metrics.accuracy_score is defined as follows(https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score):

$\texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)$

This seems like its completly neglecting the true negatives of the classification.

Example:

Predicted        1                     2                 3       
Actual 
1               5                      2                 0
2               8                      6                 2
3               3                      4                 12

And here the TP,TN,FP and FN:

                TP                     TN                   FP                FN       
1               5                      24                   11                2
2               6                      20                   6                 10
3               12                     21                   2                 7
SUM             23                     65                   19                19

In the "standard" average score I would calculate: $\frac{23+65}{23+65+19+19}=0,698$

In the sklearn implementation however it would be: $\frac{1}{42}*23= 0,548$

Why is this different? And is the other metric somewhere mentioned in the literature, I couldn't find anything so far.

Chris
  • 3
  • 1

1 Answers1

1

Your "commonly agreed on" and "standard" accuracy is meant for binary classification, in which case it agrees with the other formula from sklearn. In that case, "positive/negative" refer to the two classes, so this is also a little different from your version.

Your version of it is a sort of average of (the "mediant" of) the implicit one-vs-rest classifiers. As such, your score is meaningful, but will generally be larger than the actual common multiclass accuracy metric. For a balanced problem, a constant classifier will get a mediant-of-OVR-accuracy score of $(n-1)^2/n^2$ but an accuracy score of just $1/n$. (Back to the binary case, to compare your method, you'd have to interpret e.g. "Sum(TN)" as including both diagonal entries, so the "accuracy" there is actually $1/2n$, which agrees with the mediant-of-OVR score.)

As such, your metric is similar to macro-averaged scores (though I've never heard of that for accuracy, only precision/recall/Fbeta). Micro Average vs Macro average Performance in a Multiclass classification setting

Finally, as an opinion, since accuracy measures the probability of getting the prediction right, it's easier to interpret; your metric gives credit for not misclassifying a sample into each class it's not put in, hence the inflation. Of course, this also perhaps makes a multiclass model's accuracy sound terrible (sklearn says "this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted").

Ben Reiniger
  • 11,770
  • 3
  • 16
  • 56