0

This may be a trivial question. I tried to find the proof for the smoothness of the log-sum-exp function $f: \mathbb{R}^N \to \mathbb{R}$, defined as $$f(x) = \log \sum_i \mathrm{exp}(x_i).$$

I have seen on multiple papers stating that the log-sum-exp function is 1-smooth with respect to the $\ell_2$$ norm, but had no luck proving it myself nor finding the reference online. Can someone point me to a reference or just give me a proof?

A function is $L$-smooth with respect to the $\ell_2$ norm if $$\Vert \nabla f(x) - \nabla f(y)\Vert_2 \leq L \Vert x - y \Vert_2. $$

Thanks!

1 Answers1

1

To start we can calculate the first and second partial derivatives of $f$ \begin{align} \frac{\partial}{\partial x_j} f(x) &= \frac{\exp(x_j)}{\sum_i \exp(x_i)} \\ \frac{\partial^2}{\partial^2 x_j} f(x) &= \frac{\exp(x_j)\left( \sum_{i} \exp(x_i)\right ) - \exp(x_j)^2}{\left ( \sum_{i} \exp(x_i) \right )^2} \\ &= \frac{\exp(x_j)\left( \sum_{i\neq j} \exp(x_i)\right)}{\left(\sum_i\exp(x_i)\right)^2} \\ \frac{\partial^2}{\partial x_j \partial x_k} &= -\frac{\exp(x_j)\exp(x_k)}{\left (\sum_{i} \exp(x_i) \right )^2} & (j \neq k) \end{align} These are all done using standard rules (chain rule, quotient rule, derivative of log and exp).

Next we are going to try and apply the Gershgorin Circle Theorem to the Hessian of $f$. To do so we need to define Gershgorin discs. For each row $j$, the disk $D_j$ is defined by its center $$c_j \triangleq Hf(x)_{jj} =\frac{\partial^2}{\partial^2x_j} f(x)$$ and the and its radius $$ r_j = \sum_{k\neq j} |Hf(x)_{jk}| = \sum_{k \neq j} \left | \frac{\partial^2}{\partial x_j \partial x_k} f(x)\right | $$ The Gershgorin Circle Theorem asserts that the eigenvalues of $Hf(x)$ are contained in the union of the disks $D_j$. We also know by the Spectral Theorem and the fact that the Hessian is symmetric that the eigenvalues of $Hf(x)$ are real. What we want to show is that $c_j + r_j \leq 1$ and $c_j - r_j \geq 1$ for every $j$ since this will imply that all the eigenvalues of $Hf(x)$ are contained in $[-1,1]$, for any $x$.

To start note that from the calculations above we have \begin{align} c_j - r_j &= \frac{\exp(x_j)\left( \sum_{i\neq j} \exp(x_i)\right)}{\left(\sum_i\exp(x_i)\right)^2} - \sum_{k\neq j} \frac{\exp(x_j)\exp(x_k)}{\left (\sum_{i} \exp(x_i) \right )^2} = 0 \geq -1 \\ c_j + r_j &= \frac{\exp(x_j)\left( \sum_{i\neq j} \exp(x_i)\right)}{\left(\sum_i\exp(x_i)\right)^2} + \sum_{k\neq j} \frac{\exp(x_j)\exp(x_k)}{\left (\sum_{i} \exp(x_i) \right )^2} \\ &= \frac{2\exp(x_j)\left( \sum_{i\neq j} \exp(x_i)\right)}{\left(\sum_i\exp(x_i)\right)^2} \leq \frac{1}{2} \end{align} To see the last inequality, let $\alpha = \exp(x_j)$ and $\beta = \sum_{k \neq j} \exp(x_k)$. Then it is equivalent to show $$ \frac{2\alpha\beta}{(\alpha + \beta)^2} \leq \frac{1}{2} \iff 4\alpha \beta \leq (\alpha + \beta)^2 \iff 0 \leq \alpha^2 + 2\alpha\beta + \beta^2 - 4\alpha \beta = (\alpha - \beta)^2 $$ and the last expression clearly holds. (also note that $\alpha,\beta > 0$ so there are no dividing by zero issues). Overall this shows that the real parts of $D_j$ is contained in the interval $[0,1/2]$. This turns out to be stronger than you're asking actually.

Now we need one more analysis fact. For any function $F:\mathbb{R}^d \rightarrow \mathbb{R}^d$, if the operator norm of the Jacobian of $F$ satisfies $$\max_x ||JF(x)||_{\text{op}} \leq L$$ then $F$ is $L$-Lipschitz. That is to say, under this condition $||F(x) - F(y)||_2 \leq L||x-y||_2$ for every $x,y$. (see here for proof)

Note that $\nabla f:\mathbb{R}^d \rightarrow \mathbb{R}^d$ and by definition the Jacobian $J(\nabla f) = Hf$ the hessian of $f$. The argument above shows that $||Hf(x)||_{\text{op}} \leq 1/2$ since all the eigenvalues of $Hf \in [0,1/2]$. In turn we have $\nabla f$ is $(1/2)$-Lipschitz. Therefore we can arrive at the result $$||\nabla f(x) - \nabla f(y)|| \leq \frac{1}{2}||x - y||.$$ (Which is actually stronger than what you asked for)

Above we need the eigenvalues to be in $[-1,1]$ to ensure that the $\nabla f$ was 1-Lipschitz, it just happened to be that we could do a little bit better than that.

  • Thanks Matt. I still need to take a closer look at the Gershgorin Circle Theorem. But thank you again for such a detailed answer! – Lyapunov1729 Oct 20 '22 at 14:46