Multilabel metrics: micro vs. macro vs. weighted vs. samples?

Question

I'm working on a multilabel classification problem; there are $N$ classes and each example can belong to $[0, N]$ of those classes. Below you can see the precision and recall computed using various averaging options with sklearn.metrics.precision_recall_fscore_support. The number in square brackets is the total area under the curve.

Intuitively, I understand the difference between micro and macro: if I have imbalanced classes, then the rarer classes will still contribute the equally as the more common classes. This is one reason why macro can do better than micro, which is if rarer classes that have good performance and are skewing results with heavier contribution they they normally would have.

However, what I don't understand is why weighted can do so much better than everything else - if anything, I would think that weighted should place somewhere in between micro and macro, as rarer classes will once again be weighted less (other threads have suggested using weighted as a "stand in" for micro).

Finally, there's samples, which calculates statistics for per-example (instead of per-class), then takes the unweighted average across all examples. This is obviously the same as transposing the results so classes become examples and examples become classes, then taking the macro average. I am not sure when to use these at all.

Thanks in advance for any advice.

In short: in multilabel classification, if weighted averaging of precision and recall gives much "better" results than micro and macro, what is this indicative of?

Multilabel metrics: micro vs. macro vs. weighted vs. samples?

0 Answers0