Gradient of a matrix?

Question

I was following Stephen Boyd's convex optimisation course and came across the following slide:

Can somebody explain to me how the gradient was calculated for the quadratic and least-squares objective. Is there a general method to find the gradient of a matrix?

$f$ is not a matrix. It is a real-valued function. It takes in a vector $x$ and spits out the square of the length of some other vector. In theory, you find the gradient the same way you do with any other real-valued function. — Arthur, Jul 14 '17 at 06:54
http://thousandfold.net/cz/2013/11/12/a-useful-trick-for-computing-gradients-w-r-t-matrix-arguments-with-some-examples/ — venrey, Mar 31 '19 at 08:19

score 6 · Accepted Answer · answered Jul 14 '17 at 07:01

6

$f$ is an normal real valued function. If you want you can write it componentwise as

$$f(x) = {1\over 2}\sum_j\sum_k p_{jk}x_jx_k + \sum_j q_jx_j + r$$

Now the first double sum contains the $x_jx_k$ term twice if $j\ne k$ and if $j=k$ it becomes an $x_j^2$ term, so the derivate with respect to $x_j$ becomes:

$$f'_j(x) = \sum p_{jk}x_k + q_j$$

Which in matrix notation becomes

$$\nabla f(x) = Px + q$$

answered Jul 14 '17 at 07:01

skyking

16,654

How would you do the same for the case of least squares objective function in the picture above? Is there a general method to get the answer for any function? – humble Jul 14 '17 at 08:56
This is not the answer, the question what about the least squares objective, on why there is a transposed matrix – MartinKondor Jan 30 '22 at 08:04

P. Siehr · Answer 2 · 2017-07-17T07:27:59.597

I simply would use the Gâteaux-Derivative. That derivative is the natural expansion of the 1D Derivative $$\frac{d}{dx}f(x) = \lim_{δx→0}f(x+δx)$$to higher dimensions. Since your function maps $f:ℝ^n→ℝ$ we need an arbitrary direction $δx∈ℝ^n$, and a small increment $ε>0$. Using that "$|_{ε=0}$ formulation the Gâteaux-Derivative for your function reads \begin{align*} d(\|Ax-b\|²;[x,δx]) = (\frac{d}{dε}\|A(x+εδx) - b\|²)\big|_{ε=0} \end{align*}

First it is \begin{align*} \frac{d}{dε}\|A(x+εδx) - b\|² =& \frac{d}{dε}[(A(x+εδx) - b, A(x+εδx) - b)] \\ =&\frac{d}{dε}[\{(Ax, Ax)+ (Ax,Aεδx) + (Ax, -b)\} \\ &+ \{(Aεδx, Ax) + (Aεδx, Aεδx) + (Aεδx, -b)\} \\ &+ \{(-b, Ax) + (-b, Aεδx) + (-b, -b)\} ] \\ =¹&\frac{d}{dε}[\{\|Ax\|²+ \|b\|²+ 2(Ax, -b)\} \\ &+ ε\{2(Ax,Aδx) + 2(-b, Aδx)\} \\ &+ ε²\|Aδx\|² ]\\ =& \{2(Ax,Aδx) + 2(-b, Aδx)\} + 2ε\|Aδx\|². \end{align*} ¹Sorting by powers of ε.

Setting ε=0, yields \begin{align*} (\frac{d}{dε}\|A(x+εδx) - b\|²)\big|_{ε=0} &= 2(Ax,Aδx) + 2(-b, Aδx) \\ &= 2(Ax-b, Aδx)= (2A^\top[Ax-b], δx). \end{align*}

Hence, the derivative is $2A^\top[Ax-b]$.

That is because, $∇f = (∂_{e_1}f, ∂_{e_2}f, …)^T$. So replacing δx with $e_i$ gives: $$∂_{e_i} = {2A^\top[Ax-b]}_i.$$

Higher derivatives can be calculated in the same way: \begin{align*} \frac{d}{dε}(2A^\top[A(x+δxε-b])\big|_{ε=0} &= (2A^\top Aδx)\big|_{ε=0} \\ &=2A^\top Aδx \end{align*} $⇒∇^2f(x) = 2A^\top A.$

Mundron Schmidt · Answer 3 · 2017-07-14T07:10:55.020

It is common to define $$ \nabla ^2 f=\nabla\cdot\nabla f=\sum_{k=0}^N\partial_k^2 f = \Delta f $$ where $\Delta$ is called the Laplacian Operator. But here it isn't the case.

It seems that here we have $$ \nabla^2f=(\nabla\nabla^T)f=\begin{pmatrix}\partial_1\partial_1f & \partial_1\partial_2f & \cdots &\partial_1\partial_Nf\\\partial_2\partial_1f & \partial_2\partial_2f & \cdots&\partial_2\partial_Nf\\ \vdots & & \ddots & \vdots\\ \partial_N\partial_1f & \cdots & \cdots & \partial_N\partial_Nf \end{pmatrix}=Hess_f $$ where $Hess_f$ is called the Hessian matrix of $f$.

Edit:

It seems that $\nabla^2=\nabla\nabla^T$ is common in optimization like Surb wrote in the comment below.

Therefore it is the best to check where the operator is defined if it isn't obvious from the context. Some books has an explanation of the signs at the end.

well, $\nabla^2 = \nabla \nabla^T$ is also a common notation especially in a convex optimization book. — Surb, Jul 14 '17 at 07:04

Gradient of a matrix?

3 Answers3

Linked