I'm working on a multilabel classification problem; there are $N$ classes and each example can belong to $[0, N]$ of those classes. Below you can see the precision and recall computed using various averaging options with sklearn.metrics.precision_recall_fscore_support
. The number in square brackets is the total area under the curve.
Intuitively, I understand the difference between micro
and macro
: if I have imbalanced classes, then the rarer classes will still contribute the equally as the more common classes. This is one reason why macro
can do better than micro
, which is if rarer classes that have good performance and are skewing results with heavier contribution they they normally would have.
However, what I don't understand is why weighted
can do so much better than everything else - if anything, I would think that weighted
should place somewhere in between micro
and macro
, as rarer classes will once again be weighted less (other threads have suggested using weighted
as a "stand in" for micro
).
Finally, there's samples
, which calculates statistics for per-example (instead of per-class), then takes the unweighted average across all examples. This is obviously the same as transposing the results so classes become examples and examples become classes, then taking the macro average. I am not sure when to use these at all.
Thanks in advance for any advice.
In short: in multilabel classification, if weighted
averaging of precision and recall gives much "better" results than micro
and macro
, what is this indicative of?