9

Define the symmetric softmax of a vector $x\in \mathbb{R}^n$ to be $$L(x)=\log\sum_i(e^{x_i}+e^{-x_i}).$$

Equation (6) in this paper states that for all $x$ and $y$ $$|\nabla L(x)-\nabla L(y)|_1 \leqslant ||x-y||_{\infty}.$$

(Apparently, this property is called 1-smoothness in optimisation)

I'm having a hard time proving this. I also tried to look for a proof but couldn't find one. I'd appreciate someone pointing me to a reference containing a proof. Thanks.

Fraïssé
  • 11,275
user24121
  • 327

4 Answers4

4

Edit: This question has been bugging my mind these days, to the extent that it forced me to start a bounty! Now if someone has any insights about this, please give it a try. I think I might have been on the right track, but now I'm lost. $$L(x)=\log\sum_{i=1}^n(e^{x_i}+e^{-x_i})=\log 2+\log\sum_{i=1}^n\cosh x_i$$ Hence $$\frac{\partial L}{\partial x_k}=\frac{\sinh x_k}{\sum_{i=1}^n\cosh x_i}$$ First we prove: $$\Vert \nabla L(x)\Vert_1\le\Vert x\Vert_\infty$$ Knowing that $\Vert v\Vert_1=\sum|v_i|$ and $\Vert v\Vert_\infty=\sup|v_i|$, and also $$\frac{\sinh z}{z}\le\cosh z,\; \forall z\in\mathbb R$$ we can write: $$\begin{align} \Vert \nabla L(x)\Vert_1=\sum_{k=1}^n\left|\frac{\partial L}{\partial x_k}\right|&= \frac{\sum_{i=1}^n|\sinh x_i|}{\sum_{i=1}^n\cosh x_i}\\ &\le\frac{\sum_{i=1}^n|x_i|\cosh x_i}{\sum_{i=1}^n\cosh x_i} \le\frac{\Vert x\Vert_\infty\sum_{i=1}^n\cosh x_i}{\sum_{i=1}^n\cosh x_i}\\&=\Vert x\Vert_\infty\end{align}$$ Now write $y=x+\delta$, you need to show that $\Vert\nabla L(x+\delta)-\nabla L(x)\Vert_1\le\Vert\delta\Vert_\infty$.

But... the statement seems a bit hard to prove, and I doubt that changing $y$ to $x+\delta$ will get us anywhere. By the way, regarding this question, I came up with something like this:

Let $p=\nabla L(x)$ and $q=\nabla L(y)$, and define: $$M=\sum_{j=1}^n q_j\log p_j$$ then we have: $$q_i-p_i=\frac{\partial M}{\partial x_i}$$ and we will need to show that $\Vert \nabla M\Vert_1\le\Vert y-x\Vert_\infty$. Although it seems $\Vert p\Vert_1\le\Vert x\Vert_\infty$ and $\Vert q\Vert_1\le\Vert y\Vert_\infty$ are some valuable information, but I wasn't able to go any further.

polfosol
  • 9,245
2

Here's an actual proof. It is inspired by one I found on page 116 of "First-Order Methods in Convex Optimization", although the book builds on a lot of theory that isn't really necessary when we are only interested in $L(x)$, so I cut out all of that and reduced it to the essential steps.

As the question notes, we are interested in proving 1-smoothness of the symmetric softmax. To that extent, first define in general terms what we mean by L-smoothness:

Definition ($L$-Smoothness, somewhat informal): Let $f: \mathbb{R}^n \to \mathbb{R}$ be some function and consider some norm $\lVert\cdot\rVert$ on $\mathbb{R}^n$. $f$ is $L$-smooth if it is differentiable and $\lVert \nabla f(x) - \nabla f(y)\rVert_* \le L \lVert x - y \rVert$, where $\lVert\cdot\rVert_*$ is the dual norm defined for the dual space ${\mathbb{R}^n}^*$ of $\mathbb{R}^n$ through $\lVert v \rVert_* = \max_x \{ \langle v, x \rangle \mid \lVert x \rVert = 1 \}$ for $v \in {\mathbb{R}^n}^*$.

Now in our case, $\lVert\cdot\rVert$ is the supremum norm, and its dual norm is the 1-norm, which shows that 1-smoothness is indeed the property we are interested in. We make use of the following lemma, proved at the end:

Lemma: If $f$ is convex, then $f$ is $L$-smooth if $\langle d, \nabla^2 f(x) \cdot d \rangle \le L \lVert d \rVert^2$ for all $d \in \mathbb{R}^n$.

(As an aside: The proof hence proceeds exactly as in this paper, Proposition 4, but with different norms)

To simplify the calculation, we will prove a stronger result, namely 1-smoothness of the unsymmetric softmax $S(x) = \log\left( \sum_i e^{x_i} \right)$, because as user24121 observes, we can always plug in $x = [x_1, \ldots, x_n, -x_1, \ldots, -x_n]$ to obtain the symmetric softmax. After some calculation, we see that the Hessian of $S$ is given by $$ \nabla^2 S(x) = \mathrm{diag}(\sigma(x)) - \sigma(x)\sigma(x)^T $$ where $$ \sigma(x)_i = \frac{e^{x_i}}{\sum_j e^{x_i}} $$ And now the computation is straightforward: \begin{align} d^T \nabla^2 S(x) d &= d^T \mathrm{diag}(\sigma(x))d - (\sigma(x)^Td)^2 \\ &\le d^T \mathrm{diag}(\sigma(x))d \\ &\le \lVert d \rVert_\infty^2 \lVert \sigma(x) \rVert_1 \le \lVert d \rVert_\infty^2 \end{align}


The only thing left to do now is to prove the lemma:

Proof of the Lemma: By Taylor's Theorem, for any $x, d \in \mathbb{R}^n$, there exists $\xi \in \mathbb{R}^n$ such that $$ f(x + d) = f(x) + \nabla f(x)^T d + \frac12 d^T \nabla^2 f(\xi) d \overset{\text{assumption}}\le f(x) + \nabla f(x)^T d + \frac{L}{2} \lVert d \rVert^2 $$ We continue by proving that any $f$ satisfying the inequality is $L$-smooth. To do so, we must overcome the hurdle that we only have expressions of the form $\langle \nabla f(x + d), d \rangle$, but the definition of the dual norm requires the second argument to be independent of the first. To get around this, we take a detour via the Bregman distance:

Let $D_f(x, d) = f(x + d) - f(x) - \nabla f(x)^T d \le \frac{L}{2}\lVert d \rVert^2$ (this is the Bregman distance, although we do not require $f$ to be strictly convex). Note that because $f$ is convex, $f(x + d) \ge f(x) + \nabla f(x)^Td$ and hence $D_f(x, d) \ge 0$ for all $d$. At the same time, for any $\delta \in \mathbb{R}^n$, $$ D_f(x, d + \delta) = f(x + d + \delta) - f(x) - \nabla f(x)^T (d + \delta) $$ Bounding $f((x + d) + \delta)$ through the bound at the start of this proof yields \begin{align} D_f(x, d + \delta) &\le f(x + d) - f(x) - \nabla f(x)^T (d + \delta) + \nabla f(x + d)^T \delta + \frac{L}{2}\lVert \delta \rVert^2 \\ &= D_f(x, d) + \langle \nabla f(x + d) - \nabla f(x), \delta \rangle + \frac{L}{2} \lVert \delta \rVert^2 \end{align} Now with the very specific choice of $\delta = -\frac{\lVert \nabla f(x + d) - \nabla f(x) \rVert_*}{L} v$ for any $v \in \mathbb{R}^n$ with $\lVert v \rVert = 1$ and $\langle \nabla f(x + d) - \nabla f(x), v \rangle = \lVert \nabla f(x + d) - \nabla f(x) \rVert_*$, we can compute \begin{align} 0 &\le D_f(x, d + \delta) \le D_f(x, d) - \frac{\lVert \nabla f(x + d) - \nabla f(x) \rVert_*}{L} \langle \nabla f(x + d) - \nabla f(x), v \rangle + \frac{1}{2L} \lVert \nabla f(x + d) - \nabla f(x) \rVert_*^2 \\ &= D_f(x + d) - \frac{1}{2L} \lVert \nabla f(x + d) - \nabla f(x) \rVert_*^2 \end{align} Hence $$ \frac{1}{2L} \lVert \nabla f(x + d) - \nabla f(x) \rVert_*^2 \le D_f(x + d) \le \frac{L}{2} \lVert d \rVert^2 $$ Multiplying by $2L$ and taking the square root completes the proof.

henrikl
  • 108
1

So I've been messing around with this problem and I've made some progress, although I'm not at a full proof just yet. Maybe someone can pick up from here.

First, we notice that working with $L(x)=\log\sum_i e^{x_i}$ is without loss of generality since we can just plug in $[x^T, -x^T]^T$ and obtain the original softmax function.

For any $x,d \in \mathbb{R}^n$, \begin{align*} |\nabla L(x+d)-\nabla L(x)|_1 &= \sum_i \left| \frac{e^{x_i+d_i}}{\sum_j e^{x_j+d_j}}-\frac{e^{x_i}}{\sum_j e^{x_j}}\right|\\ &= \sum_i \left| \frac{\sum_je^{x_i+x_j+d_i}-\sum_je^{x_i+x_j+d_j}}{\sum_j \sum_k e^{x_j+x_k+d_j}}\right|\\ &= \frac{\sum_i \sum_j e^{x_i+x_j}|e^{d_i}-e^{d_j}|}{\sum_i \sum_je^{x_i+x_j}e^{d_i}}\stackrel{?}{\leqslant}||d||_{\infty}=:D. \end{align*}

By the positivity of $e^{x_i+x_j}$, it suffices to show that $\sum_i \sum_j |e^{d_i}-e^{d_j}|-De^{d_i}\leqslant 0$ (i.e., $\sum_i \sum_j |e^{d_i}-e^{d_j}|-nD\sum_i e^{d_i}\leqslant 0$). At this point, we can assume without loss of generality that $d_1 \geqslant d_2 \geqslant \cdots \geqslant d_n$, then collect terms and simplify the left-hand side of the required inequality (specifically the double-sum) to: \begin{align*} \sum_i \sum_j |e^{d_i}-e^{d_j}|-nD\sum_i e^{d_i} &= \sum_k 2(n-2k+1)e^{d_k}-nD\sum_k e^{d_k}\\ &=\sum_k ((2-D)n-4k+2)e^{d_k}. \end{align*} If all coefficients of $e^{d_k}$ are non-positive, we are done. Otherwise, we define $k_0$ to be the largest index for which the corresponding coefficient is positive; i.e., $$k_0 = \left\lfloor \frac{(2-D)n+2}{4}\right\rfloor.$$ Then we can split the terms into ones with positive and nonpositive coefficients and say: \begin{align*} \sum_{k=1}^n ((2-D)n-4k+2)e^{d_k} \leqslant \left(\sum_{k=1}^{k_0} ((2-D)n-4k+2)\right)e^{d_1}+\left(\sum_{k=k_0+1}^{n} ((2-D)n-4k+2)\right)e^{d_n}. \end{align*}

So the problem boils down to a condition on just two variables. After some massaging of the right-hand side, the condition to be shown is: $$2k_0(n-k_0-0.5Dn)(e^{d_1}-e^{d_n})-Dn^2e^{d_n}\stackrel{?}{\leqslant}0.$$

This seems to check out empirically. We can also note that whenever $D\geqslant 2$, all coefficients are nonpositive, so we can restrict our search to $-2\leqslant d_n \leqslant d_1 \leqslant 2$.

That's as far as I've gotten. If someone has any ideas on how to prove this, it would be much appreciated.

EDIT: So here's what this looks like. It's maximised at $(d_1,d_n)=(0,0)$ with value $0$, and as you head towards $(-2,-2)$, it drops and then starts tending to zero again. This is for $n=2$ but it's pretty much the same for any $n$.

user24121
  • 327
0

Here's another unfinished approach: We try to prove that for all $x, y \in \mathbb{R}^n$, $$ \lVert \nabla L(x) - \nabla L(y) \rVert_1 \le \frac1n \lVert x - y \rVert_1 \le \lVert x - y \rVert_\infty $$ where the last inequality is by p-norm relationships. That is, we reduce the question to showing that the gradient of $L$ is $\frac1n$-Lipschitz with respect to the 1-norm. To do this, we can compute the operator norm of the Hessian of $L$ because $$ \lVert \nabla L(x) - \nabla L(y) \rVert_1 \le \frac1n \lVert x - y \rVert_1 \iff \sup_{x}\lVert \nabla^2 L(x) \rVert_1 \le \frac1n $$ A proof sketch of this fact is at the the end of this post. Note that the L1 operator norm is just the maximum L1 norm of the rows ($\hat=$ columns because $\nabla^2 L$ is symmetric) of $\nabla^2 L$. After some calculation, we see that the 1-norm of the $i$-th row of the Hessian of $L$ is given by $$ \lVert (\nabla^2 L(x))_{i,:} \rVert_1 = \frac{e^{x_i}}{S}\left(1 - \sigma_i - \sum_{j \ne i} |\sigma_j|\right) + \frac{e^{-x_i}}{S}\left(1 + \sigma_i + \sum_{j \ne i} |\sigma_j|\right) $$ where $$ S = \sum_j e^{x_j} + e^{-x_j} \qquad\qquad \sigma_j = \frac{e^{x_j} - e^{-x_j}}{S} \quad (j = 1, \ldots, n) $$ So we can consider all $x_j$ except $x_i$ arbitrary but fixed, and maximise this expression with respect to $x_i$. If it ends up being at most $\frac1n$ then we are done. Note that $$ \lim_{x_i \to \pm \infty} \lVert (\nabla^2 L(x))_{i,:} \rVert_1 = 0 \qquad \lVert (\nabla^2 L(\mathbf{0}))_{i,:} \rVert_1 = \frac1n $$ which lends some (small amount of) credence to the idea that this approach might work, but unfortunately maximising $\lVert (\nabla^2 L(x))_{i,:} \rVert_1$ is very painful. Empirically it appears to hold at least for $n = 2$, but (again empirically) for any $x_j$ there is always an $x_i$ that makes it hold with equality, so there is no room for imprecision and hence probably no shortcut to maximising $\lVert (\nabla^2 L(x))_{i,:} \rVert_1$.


We want to prove that for any $f: \mathbb{R}^{n} \to \mathbb{R}^{n}$ and any norm $\lVert\cdot\rVert$, $\lVert f(x) - f(y) \rVert \le C\lVert x - y\rVert \iff \sup_{x} \lVert Df(x) \cdot v \rVert \le C \lVert v \rVert$ for all $v \in \mathbb{R}^{n}$, where $Df(x)$ is the Jacobian of $f$ taken at point $x$. The fundamental theorem of calculus provides $$ f(y) - f(x) = \int_{\theta = 0}^{1} Df(x_{\theta}) \cdot (y - x) d\theta \qquad x_{\theta} := x + \theta(y - x) $$ For the first direction, observe that $\lVert f(x) - f(y) \rVert \le C\lVert x - y\rVert$ implies that the directional derivative in every direction has norm at most $C$ (to see this, write $y = x + \delta$, divide by $\lVert \delta \rVert$, and let $\delta \to \mathbf{0}$), and hence clearly $\lVert Df(x) \rVert \le C$. For the second direction, the triangle inequality (applied using Riemann sums and taking the limit) gives $$ \lVert f(y) - f(x) \rVert = \left\Vert \int_{\theta = 0}^{1} Df(x_{\theta}) \cdot (y - x) d\theta \right\Vert \le \int_{\theta = 0}^{1} \lVert Df(x_{\theta}) \cdot (y - x) \rVert d\theta \le \int_{\theta = 0}^{1} C \lVert y - x \rVert d\theta = C \lVert x - y \rVert $$

henrikl
  • 108