Boltzmann "soft max" distribution

Question

Formula is here: $$ p(i)=\frac{e^\frac{f(i)}{T}}{\displaystyle \sum_j e^\frac{f(j)}{T}} $$

Prove:

1) Each $p(i)$ is a number between $0$ and $1$, no matter what the fitness is (positive or negative). This scheme does not require that fitness has to be positive.

2) The sum of all the $p(i)$'s is $1$, i.e. this is a probability distribution.

3) No matter what $T$ is:

If two items have same fitness, they have same probability of being picked.
If all fitnesses are the same, we pick random item.

4) No matter what the fitnesses are:

As $T\to\infty$ we tend to pick random item.
As $T\to0$ we tend to pick only the best item. That is, its probability is $1$, the probability of all others is $0$. If there are $m$ joint best items, we pick them with probability $1/m$, and all others with probability $0$.

For (2), ... if the denominator is summing up $n$ or so terms, and then you add $n$ of the $p(i)$, well... — J. M. ain't a mathematician, Aug 14 '11 at 17:40
For (1): Why would the denominator be always greater than or equal to the numerator? — J. M. ain't a mathematician, Aug 14 '11 at 17:41
...you really could have started with defining the $p$, $f$, and $T$ in the formula you gave, though... — J. M. ain't a mathematician, Aug 14 '11 at 17:42

André Nicolas · Answer 1 · 2011-08-15T14:19:19.430

Below is a probably excessively detailed answer. The first few items are plain algebra. The $4$th question needs some knowledge about the behaviour of the exponential function. If you need more detail there, please leave a comment.

To make typing easier, let $$a_i=e^{\frac{f(i)}{T}}.$$ (Actually, it is not just to make typing easier. Simpler looking expressions are easier to grasp.) Let $n$ be the number of terms in your sum. Note that $$p(i)=\frac{a_i}{a_1+a_2+\cdots+a_n}.$$

(1): For any $x$, $e^x \gt 0$. so all of the $a_i$ are positive. If $n=1$ (one term only!), then $p(1)=a_1/a_1=1$. If $n>1$, then since the $a_i$ are positive, $a_i \lt \sum a_i$, so $p(i)=a_i/(a_1+\cdots +a_n)\lt 1$.

(2):The sum $p(1)+p(2)+\cdots +p(n)$ is equal to $$\frac{a_1}{a_1+a_2+\cdots +a_n}+\frac{a_1}{a_1+a_2+\cdots +a_n}+\cdots+\frac{a_n}{a_1+a_2+\cdots +a_n}.$$ Thus $$p(1)+p(2)+\cdots +p(n)=\frac{a_1+a_2+\cdots+a_n}{a_1+a_2+\cdots +a_n}=1.$$ More compactly, $$\sum p(i)=\sum \frac{a_i}{\sum a_i}=\frac{\sum a_i}{\sum a_i}=1.$$

(3): If $f(i)=f(j)$ (same fitness) for some particular $i$ and $j$ then $a_i=a_j$, so $p(i)=p(j)$. The fact that $p(i)=p(j)$ means that the probability that $i$ is picked is the same as the probability that $j$ is picked.

If $f(i)=f(j)$ for all $i$, $j$, that means that $p(i)=p(j)$ for all pairs $(i,j)$, meaning that everyone has the same probability of being picked.

(4): This is the only part that requires that we look in more detail at the definition.

Limit as $T\to \infty$: As $T\to \infty$, $f(i)/T \to 0$, so $e^{f(i)/T}\to 1$. (Recall that $e^0=1$.) So each $a_i$ approaches $1$. This means that each $p(i)$ approaches $1/(1+1+\cdots +1)$. So each $p(i)$ approaches $1/n$. Thus, when $T$ is very large, all items have nearly equal probability of being picked.

Limit as $T\to 0$: First a technical remark. Look at $e^{1/T}$ when $T$ is close to $0$. If $T$ is positive, then $e^{1/T}$ is huge. If $T$ is very large negative, then $e^{1/T}$ is close to $0$. So it matters very much on which side of $0$ we are! I will assume that the problem really means to ask about what happens when $T$ approaches $0$ from the right, that is, through positive values.

Suppose that there is a single best item. For definiteness, assume that item $1$ is the unique best item. Then $f(1) \gt f(i)$ for all $i\gt 1$. The probability of picking item $1$ is $p(1)$, where $$p(1)=\frac{e^{\frac{f(1)}{T}}} {e^{\frac{f(1)}{T}}+e^{\frac{f(2)}{T}}+\cdots+e^{\frac{f(n)}{T}}}.$$ Divide "top" and "bottom" by $e^{\frac{f(1)}{T}}$, or equivalently multiply top and bottom by $e^{-\frac{f(1)}{T}}$. We obtain $$p(1)=\frac{1} {1+ e^{\frac{f(2)-f(1)}{T}}+ e^{\frac{f(3)-f(1)}{T}}\cdots+e^{\frac{f(n)-f(1)}{T}}}.$$

Finally, let $T$ approach $0$ from the right. Look now for example at the term $e^{\frac{f(2)-f(1)}{T}}$ in the denominator, or any of the other terms in the denominator except for the initial $1$. Recall that since item $1$ is the unique most fit one, we have $f(2) \lt f(1)$, and therefore $f(2)-f(1)<0$. So when $T$ is very close to $0$ but positive, $e^{\frac{f(2)-f(1)}{T}}$ is very close to $0$, since we are looking at an $e^y$ where $y$ is large negative.

So as $T$ approaches $0$ from the right, all of the denominator dies except for the first term, and therefore $p(1)$ approaches $1$.

Almost exactly the same idea will work if there are $m$ "fittest" items, say items $1$ to $m$. The difference is that when we divide top and bottom, there will be $m$ $1$'s in the denominator, not just the single $1$ that we had in the unique best item case.

Boltzmann "soft max" distribution

1 Answers1