In this article Macro F1 and Macro F1 two different definitions of the F1 used in the literature are demonstrated. The first F1 score is computed such as:
F1 scores are computed for each class and then averaged via arithmetic mean
The second such as:
The harmonic mean is computed over the arithmetic means of precision and recall
I was wondering which definition is actually implemented in Scikit-learn. From the docs I cannot derive which definition is used:
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.