A dual norm optimization problem

Question

I'm reading this machine learning optimization paper https://arxiv.org/pdf/2010.01412.pdf. At the last formula of page 3, they derived an optimization problem like this:

${\bf{\epsilon^*(w)}} = \underset{||\bf{\epsilon}||_p \leq\rho}{\operatorname{argmax}} \bf{\epsilon^{T}\nabla_{w}L_s(w)}$ (1)

They said this is a classical dual norm problem and the solution is:

$\bf{\hat\epsilon(w) = \rho sign(\nabla_{w}L_s(w))}|\nabla_{w}L_s(w)|^{q-1}/(||\nabla_{w}L_s(w)||_q^q)^{\frac{1}{p}}$ (2)

with $\frac{1}{p}+\frac{1}{q} = 1$ and $|.|^{q-1}$ denotes elementwise absolute value and power.

Can anyone please show me how to solve the optimization problem to arrive at the second formulas. I really appreciate.

score 2 · Accepted Answer · answered Aug 30 '22 at 04:59

First off, to reduce unnecessary clutter, let $$ x=\nabla\mathbf{_wL_s(w)}\ , $$ and $$ \hat{\epsilon}=\hat{\epsilon}\mathbf{(w)} . $$ Then equation $(2)$ becomes $$ \hat{\epsilon}=\frac{\rho\,\mathbf{sign}(x)|x|^{q-1}}{\|x\|_q^\frac{q}{p}}\ . $$ Immediately before their equation $(2)$, the authors of your cited paper note that "$\ |\cdot|^{q-1}\ $ denotes elementwise absolute value and power". Although they don't say so explicitly, the same interpretation must be applied to the function $\ \mathbf{sign(\cdot)}\ $. The equation can therefore be written as $$ \hat{\epsilon}_i=\frac{\rho\,\mathbf{sign}(x_i)|x_i|^{q-1}}{\|x\|_q^\frac{q}{p}}\ . $$ With $\ \hat{\epsilon}\ $ thus defined, a little algebraic manipulation, making liberal use of the identity $\ p+q=pq\ $, gives $$ \|\hat{\epsilon}\|_p=\rho $$ and $$ \hat{\epsilon}^\mathbf{T}x=\rho\|x\|_q\ . $$ But Hölder's inequality tells us that $$ \epsilon^\mathbf{T}x\le\|\epsilon\|_p\|x\|_q\le \rho\|x\|_q $$ if $\ \|\hat{\epsilon}\|_p\le\rho\ $. Thus, on the closed ball $\ \big\{\,\epsilon\,\big|\,\|\epsilon\|_p\le\rho\big\}\ $, the linear function $\ \epsilon^\mathbf{T}x\ $ of $\ \epsilon\ $ is bounded above by $\ \rho\|x\|_q\ $, and achieves that bound for $\ \epsilon=\hat{\epsilon}\ $. It follows that $$ \hat{\epsilon}=\arg\max_{\|\epsilon\|_p\le\rho}\epsilon^\mathbf{T}x\ . $$

thank you! it's a great proof. It took me some writing to fully understand what you showed. I really appreciate it. I have to say the solution they wrote looks a bit daunting with the sign(x) but i guess it doesnt matter when multiplying epsilon_hat with x:) — Việt Nguyễn, Aug 30 '22 at 09:15
Thank you, but there's not really much of anything original in the proof. If you look up any proof that an $\ L_q\ $ space is the dual of an $L_p$ space, it will follow much the same lines. — lonza leggiera, Aug 30 '22 at 10:48
Since lonza only showed how to verify that it is a solution and not how to actually derive this solution, this can be done using the fact that we have equality in Hoelders Inequality iff $|\hat \epsilon|^p = |x|^q \frac{|\hat \epsilon|{p}^p}{|x|{q}^q}$. This can be transformed such that we get the absolute of the components of $\hat \epsilon$. The $sign(x)$ follows since said inequality only holds for the absolute of the scalar product. It ensures that the scalar product and the absolute of it are equal. — crush3dice, Oct 04 '22 at 09:03
The fact that equality holds if said equality holds is nicely shown here https://math.stackexchange.com/questions/87636/on-the-equality-case-of-the-h%C3%B6lder-and-minkowski-inequalities . — crush3dice, Oct 04 '22 at 09:04

A dual norm optimization problem

1 Answers1

Linked