Sparse_categorical_crossentropy vs categorical_crossentropy (keras, accuracy)

Question

Which is better for accuracy or are they the same? Of course, if you use categorical_crossentropy you use one hot encoding, and if you use sparse_categorical_crossentropy you encode as normal integers. Additionally, when is one better than the other?

featuredpeow · Accepted Answer · 2019-09-14T13:54:26.427

67

Use sparse categorical crossentropy when your classes are mutually exclusive (e.g. when each sample belongs exactly to one class) and categorical crossentropy when one sample can have multiple classes or labels are soft probabilities (like [0.5, 0.3, 0.2]).

Formula for categorical crossentropy (S - samples, C - classess, $s \in c $ - sample belongs to class c) is:

$$ -\frac{1}{N} \sum_{s\in S} \sum_{c \in C} 1_{s\in c} log {p(s \in c)} $$

For case when classes are exclusive, you don't need to sum over them - for each sample only non-zero value is just $-log p(s \in c)$ for true class c.

This allows to conserve time and memory. Consider case of 10000 classes when they are mutually exclusive - just 1 log instead of summing up 10000 for each sample, just one integer instead of 10000 floats.

Formula is the same in both cases, so no impact on accuracy should be there.

edited Sep 14 '19 at 13:54

answered Dec 01 '18 at 08:20

featuredpeow

786
6
6

1

Do they impact the accuracy differently, for example on mnist digits dataset? – Master M Dec 01 '18 at 08:47
1

Mathematically there is no difference. If there is significant difference in values computed by implementations (say tensorflow or pytorch), then this sounds like a bug. Simple comparison on random data (1000 classes, 10 000 samples) show no difference. – featuredpeow Dec 01 '18 at 14:20
Dear frenzykryger, I guess you forgot a minus for the one sample case only: "for each sample only non-zero value is just -log(p(s $\in$ c))". For the rest, nice answer. – Nicg Sep 13 '19 at 12:48
You're right. Thanks! – featuredpeow Sep 14 '19 at 13:54
@frenzykryger I am working on multi-output problem. I have 3 seperate output o1,o2,o3 and each one have 167,11,7 classes respectively. I've read your answer that it'll make no difference but is there any difference if I use sparse__ or not. Can I go for categorical for the last 2 and sparse for the first one as there are 167 classes in the first class? – Deshwal Jan 08 '20 at 04:58

score 38 · Answer 2 · edited Feb 07 '21 at 21:06

38

The answer, in a nutshell

If your targets are one-hot encoded, use categorical_crossentropy. Examples of one-hot encodings:

[1,0,0]
[0,1,0] 
[0,0,1]

But if your targets are integers, use sparse_categorical_crossentropy. Examples of integer encodings (for the sake of completion):

1
2
3

edited Feb 07 '21 at 21:06

Ethan

1,633
9
24
39

answered Jul 19 '19 at 09:32

user78035

381
3
2

2

Do I need a single output node for sparse_categorical_crossentropy? And what does the from_logits argument mean? – Leevo Dec 15 '19 at 17:01
3

@Leevo from_logits=True tells the loss function that an activation function (e.g. softmax) was not applied on the last layer, in which case your output needs to be as the number of classes. This is equivalent to using a softmax and from_logits=False. However, if you end up using sparse_categorical_crossentropy, make sure your target values are 1D. E.g. [1, 1, 0, 1, ...] (and not [[1], [1], [0], [1], ...]). On the other hand, if you use categorical_crossentropy and your target values are 1D, you need to apply keras.utils(targets) on them first to convert them to 2D. – Alaa M. Jul 03 '21 at 13:04
keras.utils.to_categorical(targets) * – Alaa M. Jul 06 '21 at 07:08
I experienced vastly different results between the two losses and finally saw here that the last dimension needs to be removed. Thanks! – N4ppeL Dec 06 '22 at 18:05

Sparse_categorical_crossentropy vs categorical_crossentropy (keras, accuracy)

2 Answers2