8

I was following Stephen Boyd's convex optimisation course and came across the following slide:

enter image description here

Can somebody explain to me how the gradient was calculated for the quadratic and least-squares objective. Is there a general method to find the gradient of a matrix?

humble
  • 430
  • 3
    $f$ is not a matrix. It is a real-valued function. It takes in a vector $x$ and spits out the square of the length of some other vector. In theory, you find the gradient the same way you do with any other real-valued function. – Arthur Jul 14 '17 at 06:54
  • http://thousandfold.net/cz/2013/11/12/a-useful-trick-for-computing-gradients-w-r-t-matrix-arguments-with-some-examples/ – venrey Mar 31 '19 at 08:19

3 Answers3

6

$f$ is an normal real valued function. If you want you can write it componentwise as

$$f(x) = {1\over 2}\sum_j\sum_k p_{jk}x_jx_k + \sum_j q_jx_j + r$$

Now the first double sum contains the $x_jx_k$ term twice if $j\ne k$ and if $j=k$ it becomes an $x_j^2$ term, so the derivate with respect to $x_j$ becomes:

$$f'_j(x) = \sum p_{jk}x_k + q_j$$

Which in matrix notation becomes

$$\nabla f(x) = Px + q$$

skyking
  • 16,654
  • How would you do the same for the case of least squares objective function in the picture above? Is there a general method to get the answer for any function? – humble Jul 14 '17 at 08:56
  • This is not the answer, the question what about the least squares objective, on why there is a transposed matrix – MartinKondor Jan 30 '22 at 08:04
3

I simply would use the Gâteaux-Derivative. That derivative is the natural expansion of the 1D Derivative $$\frac{d}{dx}f(x) = \lim_{δx→0}f(x+δx)$$to higher dimensions. Since your function maps $f:ℝ^n→ℝ$ we need an arbitrary direction $δx∈ℝ^n$, and a small increment $ε>0$. Using that "$|_{ε=0}$ formulation the Gâteaux-Derivative for your function reads \begin{align*} d(\|Ax-b\|²;[x,δx]) = (\frac{d}{dε}\|A(x+εδx) - b\|²)\big|_{ε=0} \end{align*}

First it is \begin{align*} \frac{d}{dε}\|A(x+εδx) - b\|² =& \frac{d}{dε}[(A(x+εδx) - b, A(x+εδx) - b)] \\ =&\frac{d}{dε}[\{(Ax, Ax)+ (Ax,Aεδx) + (Ax, -b)\} \\ &+ \{(Aεδx, Ax) + (Aεδx, Aεδx) + (Aεδx, -b)\} \\ &+ \{(-b, Ax) + (-b, Aεδx) + (-b, -b)\} ] \\ =¹&\frac{d}{dε}[\{\|Ax\|²+ \|b\|²+ 2(Ax, -b)\} \\ &+ ε\{2(Ax,Aδx) + 2(-b, Aδx)\} \\ &+ ε²\|Aδx\|² ]\\ =& \{2(Ax,Aδx) + 2(-b, Aδx)\} + 2ε\|Aδx\|². \end{align*} ¹Sorting by powers of ε.

Setting ε=0, yields \begin{align*} (\frac{d}{dε}\|A(x+εδx) - b\|²)\big|_{ε=0} &= 2(Ax,Aδx) + 2(-b, Aδx) \\ &= 2(Ax-b, Aδx)= (2A^\top[Ax-b], δx). \end{align*}

Hence, the derivative is $2A^\top[Ax-b]$.

That is because, $∇f = (∂_{e_1}f, ∂_{e_2}f, …)^T$. So replacing δx with $e_i$ gives: $$∂_{e_i} = {2A^\top[Ax-b]}_i.$$

Higher derivatives can be calculated in the same way: \begin{align*} \frac{d}{dε}(2A^\top[A(x+δxε-b])\big|_{ε=0} &= (2A^\top Aδx)\big|_{ε=0} \\ &=2A^\top Aδx \end{align*} $⇒∇^2f(x) = 2A^\top A.$

P. Siehr
  • 3,672
1

It is common to define $$ \nabla ^2 f=\nabla\cdot\nabla f=\sum_{k=0}^N\partial_k^2 f = \Delta f $$ where $\Delta$ is called the Laplacian Operator. But here it isn't the case.

It seems that here we have $$ \nabla^2f=(\nabla\nabla^T)f=\begin{pmatrix}\partial_1\partial_1f & \partial_1\partial_2f & \cdots &\partial_1\partial_Nf\\\partial_2\partial_1f & \partial_2\partial_2f & \cdots&\partial_2\partial_Nf\\ \vdots & & \ddots & \vdots\\ \partial_N\partial_1f & \cdots & \cdots & \partial_N\partial_Nf \end{pmatrix}=Hess_f $$ where $Hess_f$ is called the Hessian matrix of $f$.

Edit:

It seems that $\nabla^2=\nabla\nabla^T$ is common in optimization like Surb wrote in the comment below.

Therefore it is the best to check where the operator is defined if it isn't obvious from the context. Some books has an explanation of the signs at the end.