What does $-p \ln p$ mean if p is probability?

Question

In statistical mechanics entropy is defined with the following relation:

$$S=-k_B\sum_{i=1}^N p_i\ln p_i,$$

where $p_i$ is probability of occupying $i$th state, and $N$ is number of accessible states. I understand easily what probability is: for a frequentist it's just average frequency of getting this result. But I have a hard time trying to intuitively understand what $-p_i \ln p_i$ means. In the case where $p_i=p_j\; \forall i\ne j$ this reduces to $\ln N$, i.e. logarithm of number of accessible states.

But in general case of unequal probabilities, what does $-p_i\ln p_i$ really represent? Is it some sort of "(logarithm of) average number of accessible states"? Or maybe it's more useful to try to understand what $p_i^{p_i}$ is (but this seems even harder)?

a bit naive (but the best I can offer): if you take $M$ independent copies of your system, with $M$ very big, you will have $N^M$ possible states, but only some of them will have a reasonably non-zero probability, namely those, where the frequency of each $i$ is close to $p_i$. Then $S$ is (the limit as $M\to\infty$ of) $1/M$ times log of the number of these states. — user8268, Feb 12 '16 at 21:43
Related: http://math.stackexchange.com/questions/331103/intuitive-explanation-of-entropy and http://math.stackexchange.com/questions/663351/what-does-the-logpx-mean-in-the-calculation-of-entropy — Winther, Feb 12 '16 at 21:49
@Omnomnomnom as I understand, it's the limit of averages as number of trials goes to infinity. — Ruslan, Feb 12 '16 at 22:08
For events $i$ with probabilities $1=\sum_ip_i$, $\eta_i:=-p_i\ln p_i$ is the information gained by event $i$ happening. Sorry, I don't have references at hand. — Gyro Gearloose, Feb 12 '16 at 22:30
@Ruslan $-\ln p_i$ is the information gained if the $i$th event occurs, and $\sum p_i(-\ln p_i)$ is the expected amount of information gained — Ben Grossmann, Feb 12 '16 at 23:03

Thomas Andrews · Accepted Answer · 2016-02-12T22:39:08.520

Let's say you wanted to compress the results of a sequence of independent trials into a sequence of bits.

Then the "ideal" encoding of the result of the trials would have $-\log_2 p_i$ bits for event $i$. This is in the limit, as the number of trials approaches infinity.

Now, what is the expected number of bits per trial? Then, since it is $-\log_2 p_i$ with probability $p_i$, the result is $-\sum p_i\log_2 p_i$. That is, if you want to encode $N$ occurrences of this event, you are going to require, on average, $-N\sum p_i\log_2 p_i$ with your absolutely best encoding.

You can see this most ideally when the $p_i$ are all if the form $\frac{1}{2^{k_i}}$.

For example, if $p_1=1/2, p_2=1/4, p_3=1/4$, then an "ideal" encoding has '0' for event $1$, $10$ for event $2$, and $11$ for event $3$. Then the expected bits per trial is $\frac{1}{2}\cdot 1 + \frac{1}{4}\cdot 2+\frac{1}{4}\cdot 2 = -\sum p_i\log p_i=\frac{3}{2}$. This means, with $N$ trials of this sort, the expected number of bits to store the results will be $\frac{3}{2}N$.

So entropy is also part of what mathematicians call "information theory." That is, the entropy of a system tells you how much (expected) information is needed to describe the results.

Now, if your probabilities are not so nice, then you'd have to encode smarter. For example, if $p_1=p_2=p_3=\frac{1}{3}$, then you wouldn't get "ideal" storage by storing the values one at a time. But, say, if you took five bits at a time, you could store three results, since in $5$ bits, there are $32$ values, and thus you could store any of $27$ results of each roll. In $8$ bits, you can store the result of $5$ trials. In $m$ bits, you can store $\log_3(2^m)$ results. So to store $n$ results, you need $m$ bits with $\log_3(2^m)\geq n$, which is $$m\geq \frac{n}{\log_3 2} = n\log_2 3 = -n\sum p_i\log_2 p_i$$

So $-p_i\log p_i$ is not really the significant thing. The significant thing is storing the result $i$ in $-\log p_i$ bits. In general, if you stored event $i$ as (an average of) $b_i$ bits, then the "expected" number of bits in a single trial would be:

$$\sum p_ib_i$$

It's just that the ideal storage, which minimizes the expected number of bits for a huge number of trials, is $b_i=-\log p_i$.

Remark to the readers: the fact that ${-\log_2 p_i}$ for a binary encoding is the ideal length follows from Shannon's source coding theorem. See also the entropy encoding Wikipedia page and links therein. — Ruslan, Apr 12 '20 at 08:02

score 1 · Answer 2 · answered Feb 12 '16 at 21:47

You can think of $S=S(p)$ as the complexity of the system, and of $-p_i\log p_i$ has the "potential complexity" of a particular state, although it is unusual to see one unrelated to the others.

A more conceptual answer is that all is devised so that the entropy of independent configurations adds: this is due to the fact that $f(x)=-x\log x$ is the only nonnegative function, up to a scalar multiple, satisfying $$ S(p*q)=S(p)+S(q),\quad \text{taking}\quad S(p)=\sum_{i=1}^Nf(p_i), $$ for all $p$ and $q$, where $p*q$ is the probability vector with the $N^2$ components $p_iq_j$.

What does $-p \ln p$ mean if p is probability?

2 Answers2