How does binary cross entropy work?

Question

Let's say I'm trying to classify some data with logistic regression.

Before passing the summed data to the logistic function (normalized in range $[0,1]$), weights must be optimized for desirable outcome. In order to find optimal weights for classification purposes, relatively minimizable error function must be found, this can be cross entropy.

From my knowledge, cross entropy measures quantification between two probability distributions by bit difference between set of same events belonging to two probability distributions.

For some reason, cross entropy is equivalent to negative log likelihood. Cross entropy loss function definition between two probability distributions $p$ and $q$ is:

$$H(p, q)=-\sum_{x}p(x)\,log_e(q(x))$$

From my knowledge again, If we are expecting binary outcome from our function, it would be optimal to perform cross entropy loss calculation on Bernoulli random variables.

By definition probability mass function $g$ of Bernoulli distribution, over possible outcome $x$ is:

$$g(x|p)=p^{x}(1-p)^{1-x} \ \textrm{for} \ x\in [0, 1]$$

Which means that probability is $1-p$ if $x=0$ and $p$ if $x=1$.

Bernoulli probability distribution is based on binary outcome and therefore process of cross entropy being performed on Bernoulli random variables is called binary cross entropy:

$$\mathcal{L}(\theta)= -\frac{1}{n}\sum_{i=1}^n \left[y_i \log(p_i) + (1-y_i)\log(1-p_i) \right]$$

Is this true? why are negative logarithm likelihoods associated with cross entropy? why does Bernoulli random variable perform so well?

In short, how does binary cross entropy work?

The argument of your last sum has two terms, corresponding to the two classes. These are the two "x" in your first sum. The Bernoulli distribution immediately follows in a binary classifier. — Emre, Jul 13 '18 at 19:03
@Emre My apologies couldn't quite understand your statement - Aren't Bernoulli distributions of actual value and predicted value compared using cross entropy (negative log-likelihood). — ShellRox, Jul 13 '18 at 19:15
Yes, they are. I was just trying to relate the two expressions. The latter is equivalent to comparing the empirical distribution of $y_i$ and $1-y_i$ to the model distribution. The reason we do this is that we do not know the true distribution ($p(x)$ in your first expression) from which $y$ is sampled. Your understanding is correct. — Emre, Jul 13 '18 at 20:36
@Emre So from my understanding empirical distribution of Bernoulli distribution probabilities and normal probability distribution of predicted value are compared using cross entropy. From what I know, empirical distribution simply contains boolean valued outcomes therefore it can be useful for classification purposes, but is there any reason for why is this so effective? I'm little confused with this, Thanks! — ShellRox, Jul 13 '18 at 21:09
So effective relative to what? Effective at all? Because cross entropy is an information theoretic loss function that minimizes the number of bits required to model $p$ by $q$. Furthermore, it is equivalent to minimizing the KL divergence between them. So it is effective because it is a loss function suitable for probability measures; that of the the empirical distribution, and the model's. — Emre, Jul 13 '18 at 21:15
@Emre Yes that was what I asked. I guess I'll have to study Kullback–Leibler divergence deeper to understand this more specifically. Thanks ! — ShellRox, Jul 13 '18 at 21:47
Possible duplicate of The cross-entropy error function in neural networks — KT., Nov 23 '18 at 13:29

score 8 · Accepted Answer · answered Jan 13 '19 at 11:18

When doing logistic regression you start calculating a bunch of probabilities $p_i$ and your target is maximize the product of those probabilities (as they're considered independent events). The higher the result of the product the better is your model.
As we are dealing with probabilities we are multiplying numbers between 0 and 1, therefore, if you multiply a lot of those numbers you would get smaller and smaller results. So we need a way to move from probabilities multiplication to a sum of other numbers.
Then is when $ln$ function enters in to play. We can use some of this function properties such as:
- $ln(a b) = ln(a) + ln(b)$.
- When our prediction is perfect i.e. 1, the $ln(1) = 0$.
- $ln$ lower than 0 are growing negative numbers e.g. $ln(0.9) = -0.1$ and $ln(0.5) = -0.69$.
So we can move from maximizing the multiplication of probabilities to minimizing the sum of the $-ln$ of those probabilities. The resulting cross-entropy formula is then:

$$ - \sum_{i=1}^m y_i ln(p_i) + (1-y_i) log (1-p_i) $$

If $y_i$ is 1 the second term of the sum is 0, likewise, if $y_i$ is 0 then the first term goes away.
Intuitively cross entropy says the following, if I have a bunch of events and a bunch of probabilities, how likely is that those events happen taking into account those probabilities? If it is likely, then cross-entropy will be small, otherwise, it will be big.

How does binary cross entropy work?

1 Answers1