Let's say I'm trying to classify some data with logistic regression.
Before passing the summed data to the logistic function (normalized in range $[0,1]$), weights must be optimized for desirable outcome. In order to find optimal weights for classification purposes, relatively minimizable error function must be found, this can be cross entropy.
From my knowledge, cross entropy measures quantification between two probability distributions by bit difference between set of same events belonging to two probability distributions.
For some reason, cross entropy is equivalent to negative log likelihood. Cross entropy loss function definition between two probability distributions $p$ and $q$ is:
$$H(p, q)=-\sum_{x}p(x)\,log_e(q(x))$$
From my knowledge again, If we are expecting binary outcome from our function, it would be optimal to perform cross entropy loss calculation on Bernoulli random variables.
By definition probability mass function $g$ of Bernoulli distribution, over possible outcome $x$ is:
$$g(x|p)=p^{x}(1-p)^{1-x} \ \textrm{for} \ x\in [0, 1]$$
Which means that probability is $1-p$ if $x=0$ and $p$ if $x=1$.
Bernoulli probability distribution is based on binary outcome and therefore process of cross entropy being performed on Bernoulli random variables is called binary cross entropy:
$$\mathcal{L}(\theta)= -\frac{1}{n}\sum_{i=1}^n \left[y_i \log(p_i) + (1-y_i)\log(1-p_i) \right]$$
Is this true? why are negative logarithm likelihoods associated with cross entropy? why does Bernoulli random variable perform so well?
In short, how does binary cross entropy work?