Proving that parametrized softmax is $O(\gamma)$ smooth?

Question

I need to prove that the following question is O($\gamma$) smooth: $$ f_{\gamma}(w) = \frac{\ln(\sum_{i} e^{\gamma(a_{i}w - b_{i})}))}{\gamma} $$

where $a_{i} \in R^{d} , b_{i} \in R$ and $||a_{i}|| \leq 1, |b_{i}| \leq 1$

denote $h(w,i) = e^{\gamma(a_{i}w - b_{i})} $

The first derivative is: $$ \frac{\sum_i h(w,i) ai}{\sum_i h(w,i)} $$

Showing that the first derivative is O($\gamma$) seems hard (unclear to me how we can extract the vector from e), so I tried computing the hessian. I received that the partials are: $$ A_{jk} = \frac{\partial f_{\gamma}}{\partial w_{j} \partial w_{k}}= \frac{\sum_{i,d} h(w,i)(h(w,d)a_{i,k}(a_{ij} - a_{dj})}{(\sum_{i} h(w,i))^{2}} $$

Now I need to show that $ \beta I \leq A \leq \beta I $ where $\beta$ is O($\gamma$), but the hesian is complicated and the bound is too abstract for me to know how to tackle this.

What is the definition of a $\mathcal{O}(\gamma)$ smooth function? Does it suffice to prove that the first derivative at $\gamma=0$ is a smooth function of $w$? — DinosaurEgg, Feb 22 '22 at 22:48
You can use the chain rule for Hessians: $\nabla^2 f(x) = A^T \nabla^2 h(Ax) A$ when $f = h(Ax)$. — VHarisop, Feb 22 '22 at 22:50
@DinosaurEgg Please look here for the definition of beta-smoothness - https://www.math.univ-toulouse.fr/~agarivie/sites/default/files/8_optimization.pdf
By O($\gamma$)-smooth I mean that Beta is $\gamma$ smooth up to constants — Bar, Feb 23 '22 at 11:32
@Bar certainly. I will add an answer in a few hours, if nobody else beats me to it. — VHarisop, Feb 24 '22 at 18:57

score 2 · Accepted Answer · answered Feb 24 '22 at 21:46

Your function, $f(w) := \frac{1}{\gamma} \log \sum_{i=1}^n \exp(\gamma (a_i^T w - b_i))$, can be written as the composition of two functions, and thus you can use the chain rule for Hessians to simplify your problem.

In particular, let $$ A = \gamma \begin{bmatrix} a_1^T \\ \vdots \\ a_n^T \end{bmatrix}, \quad b = \gamma \begin{bmatrix} b_1 \\ \vdots \\ b_n \end{bmatrix}. $$

Then, it's easy to see that $f = h(Aw - b)$, where $h$ is the function defined as

$$ h: \mathbb{R}^n \to \mathbb{R} \quad \text{with} \quad h = \frac{1}{\gamma} \log \left( \sum_{i=1}^n z_i \right). $$

We now apply the chain rule (for a reference see, e.g., Appendix A.4.3 in Boyd & Vanderberghe, 2004), which reads

$$ \nabla^2 f(w) = A^T \nabla^2 h(Aw - b) A, $$

where $\nabla^2 h(Aw - b)$ is the Hessian of $h$ evaluated at $Aw - b$.

The Hessian of $h$ is known. In particular, denote $v := \mathbf{exp}(z)$, the exponential of the vector $z$ (taken elementwise). Then the Hessian of $h$ evaluated at $z$ is:

$$ \nabla^2 h(z) = \frac{1}{\gamma} \left( \mathbf{diag}\left(\left\{ \frac{v_i}{\sum_{j=1}^n v_j}\right\}_{i=1}^n \right) - \left(\frac{v}{\sum_{j=1}^n v_j}\right) \left(\frac{v}{\sum_{j=1}^n v_j}\right)^T \right). $$

Note that this Hessian is the difference of a diagonal matrix and a rank-one matrix. In addition, all elements involved are at most $1$. I'll leave it to you to bound $\| \nabla^2 h(z) \|_{2}$, but once you have this bound you can simply use

$$ \|\nabla^2 f(w) \|_2 \leq \| A \|^2_2 \|\nabla^2 h(Aw - b)\|_2, $$

where $\| X \|_2$ indicates the spectral norm of a matrix $X$.

Can you pick it up from here?

Thanks! I got it up to bounding $||\nabla ^2 h(z)||_{2}$. It should be $\frac{constant}{\gamma}$.
The eigenvalues of the diagonal matrix are $\frac{v_{i}}{\sum v_j}$, and I think the single eigenvalue of the rank 1 matrix is $\frac{\sum_{i=1}^n v_{i}^2}{(\sum_{j=1}^n v_{j})^2}$, the sum of the trace. So I can bound $||\nabla ^2 h(z)||_{2}$ with $\frac{1}{\gamma}$ — Bar, Feb 25 '22 at 10:05

Proving that parametrized softmax is $O(\gamma)$ smooth?

1 Answers1