Class imbalance and "all zeros" one-hot encoding?

Question

I tried this example for a multi class classifier, but when looking at the data I realized two things:

There are many examples of "all zeros" vectors, that is, messages that don't belong in any classification.
These all-zeros are actually the majority, by far.

Is it valid to have an all-zeros output for a certain input? I would guess a Sigmoid activation would have no problems with this, by simply not trying to force a one out of all the "near zero" outputs.

But I also think an "accuracy" metric will be skewed too optimistically: if all outputs are zero 90% of the time, the network will quickly overfit to always output 0 all the time, and get 90% score.

are you sure that this is not an effect of processing mistake to have all zeros? — quester, Oct 05 '19 at 12:13
also if you have all zeros in training data so maybe add additional column that will be = 1 when this situation happens else 0 — quester, Oct 05 '19 at 12:14

score 0 · Accepted Answer · answered Sep 04 '19 at 10:56

I understand your questions as follows:

Is it valid to have an all-zeros output for a certain input?

Yes its possible in some cases like:

In the tutorial "jigsaw-toxic-comment-classification-challenge", the data is taken from wikipedia comments, so generally there is no toxic behavior in people's comment, it is very rare that a person posts "bad" comments in an informative source like wikipedia, this might lead to all the labels being zero for a particular example.
In a single label classification like predicting whether a person has a rare disease,the dataset would contain labels that are mostly "zero"-(no disease)-skewed towards zero

This happens in datasets where the positive output to be predicted is a very rare case.

I guess your second question is -whether "accuracy" is a good way of evaluating a model for this type of a problem

You are right, in such a case even a simple program/model that outputs zeros for all the inputs would achieve an accuracy of more than 90%, so here accuracy is not a good metric to evaluate the model on.

You should go through the metrics f1_score,recall, and precision which are ideal for this type of problems.

Basically here we are interested in "out of those which are predicted positive, how many are really positive -precision" and "out of those which should have been predicted positive, how many are really predicted positive"

If my definitions seem confusing to understand, please got through the link below:

f1_score/recall/precision

Hope I was helpful

Class imbalance and "all zeros" one-hot encoding?

1 Answers1