I'm trying to understand entropy and KL divergence. While it makes sense in a simplistic case, such as the case of a coin flip, I struggle wrapping my head around it when it is a more complicated case where the information content is a decimal. I am trying to imagine it in the form of a binary tree, where $$ \log_2\left(\frac{1}{p(x)}\right) $$ is the depth to the leaf of the binary tree. This would give us the number of moves we would have to take to reach the leaf from the root. However, if we have something like: $$ p(x_1) = \frac{7}{8} , p(x_2) = \frac{1}{8} $$ I struggle to visualize the meaning, besides from a functional point of view. How can I interpret that we have $$ \log_2\left(\frac87\right) = 0.193 $$ "bits" of information, is there a way to visualize this, preferably in the style of binary tree codings?
-
1Hi, I am still trying to find also a good intuitive explanation... so far the best explanation near to that is what is explained in this paper: "Transmission of Information" (1928), R. V. L. Hartley... hope it helps. – Joako Jun 12 '23 at 22:35
-
1@Joako this helped me a lot: https://www.youtube.com/watch?v=v68zYyaEmEA You must "accept" the bit as a different unit of probability. A binary trees number of layers scales logarithmically. To convert our bin. tree with #$a$ leafs into tree with #$b$ leafs, we need to double the leafs $\log_2(b) - \log_2(a)$ times, which is can be interpreted as: full depth difference of trees + remainder. The remainder is doubling a certain percentage of the leafs, while the full depth difference doubles it perfectly. So to convert a bin. tree with 7 leafs into an 8, we would need $7 * 2^{0.193} = 8$. – nullexception Jun 13 '23 at 04:17
-
1You might enjoy the stackexchange discussion here or the introductory discussions in Elements of Information Theory, where the authors describe the entropy in a similar way to you: "the minimum expected number of binary questions required to determine the value of $X$". They prove that characterization in Chapter 5, but I haven't actually read that part. :D – Josh Keneda Jun 13 '23 at 06:48
1 Answers
Perhaps focusing on the definition of entropy as an expected value may help you.
Remember that a contiuous random variable (RV) $X$ with a distribution $p(x)$ has an expected value given by $$ \langle X\rangle = \int{x p(x) dx}. $$
By analogy, and referring to the definition of entropy (here I'm using the continuous case), one has that $$ H = \langle -\log(p(x)) \rangle = -\int{p(x)\log(p(x))dx} $$
Now, what's the meaning of this unusual RV, given by -$\langle \log(p(x)) \rangle$? First, note that the minus sign and the log allow us to express $H$ as $$ H = \left\langle \log\left(\frac{1}{p(x)}\right)\right\rangle $$
Look at the expression above and think about the magnitude of $p$ in two extreme cases:
- A very common event $x$, which roughly leads to $p(x) \approx 1$, and then $\log(1/p(x)) \approx 0$; and
- A very rare event $x$, which roughly leads to $p(x) \approx 0$, making $\log(1/p(x))$ grow significantly.
Well, think about the amount of information these two extreme cases carry: which one of them are more informative than the other? The common event, which is conceptually ordinary; or the rare event, which by its own definition tells us that something unusual is going to happen?
Conceptually speaking, then, $-\log(p(x))$ may be seen as the amount of information carried by the event $x$. Therefore, $H$ would correspond to the average amount of information carried by the system, since a sum over all events is being performed (the integral sum)
-
Maybe you should differentiate better between entropy (defined for a discrete random variable) and differential entropy (defined for a continuous random variable). The question seems to relate to the (discrete) entropy. – PC1 Jun 13 '23 at 04:42
-
@PC1, except for a well-known issue with reference value which exists only in the continuous case, it's immaterial differentiating between discrete and continuous ("differential") entropy, since the meaning of the sum is the same for both: getting the expected value – bytesAndDishes1532 Jun 13 '23 at 07:37