Taking the gradient of $f(\mathbf{x}) = \frac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2$

Question

Section 4.5 of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says that the gradient of

$$f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2$$

is

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b}$$

My understanding is that $f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2$ is the square of the Euclidean norm. So we have that

$$\begin{align} f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2 &= \dfrac{1}{2} \left( \sqrt{(\mathbf{A} \mathbf{x} - \mathbf{b})^2} \right)^2 \\ &= \dfrac{1}{2} (\mathbf{A} \mathbf{x} - \mathbf{b})^2 \\ &= \dfrac{1}{2} (\mathbf{A} \mathbf{x} - \mathbf{b})(\mathbf{A} \mathbf{x} - \mathbf{b}) \\ &= \dfrac{1}{2} [ (\mathbf{A}\mathbf{x})(\mathbf{A} \mathbf{x}) - (\mathbf{A} \mathbf{x})\mathbf{b} - (\mathbf{A} \mathbf{x})\mathbf{b} + \mathbf{b}^2 ] \ \ \text{(Since matrix multiplication is distributive.)} \\ &= \dfrac{1}{2} [(\mathbf{A} \mathbf{x})^2 - 2(\mathbf{A} \mathbf{x})\mathbf{b} + \mathbf{b}^2] \ \ \text{(Note: Matrix multiplication is not commutative.)} \end{align}$$

It's at this point that I realised that, since we're working with matrices, I'm not really sure how to take the gradient of this. Taking the gradient of $f(\mathbf{x})$ with respect to $\mathbf{x}$, we get something like

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \dfrac{1}{2} [2 (\mathbf{A} \mathbf{x}) \mathbf{A}] - \dfrac{1}{2}[2(\mathbf{A} \mathbf{A} \mathbf{x})\mathbf{b}]$$

So what is the reasoning that leads us to get $\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b}$? Where did the transposed matrices come from?

I would greatly appreciate it if people would please take the time to clarify this.

$Ax-b$ is a vector. In your link the Euclidean norm of a vector is defined. It doesn't match up with what you did. (How is the square of s vector defined?) — ViktorStein, Jan 14 '20 at 06:33
Have a look at this wiki page, under the subsection "Derivatives with vectors". — ViktorStein, Jan 14 '20 at 06:36
Also there are a lot of similar questions on this site. Have you had a look at them (on approach0.xyz)? — ViktorStein, Jan 14 '20 at 06:37
Related: https://math.stackexchange.com/q/3424167/339790 and https://math.stackexchange.com/q/3493984/339790 and https://math.stackexchange.com/q/3420325/339790 — Rodrigo de Azevedo, Jan 15 '20 at 10:27

score 5 · Accepted Answer · answered Jan 14 '20 at 06:37

5

We must take the derivative with finesse, and that means we use the chain rule. Note that $f = g \circ h$, where $h(x) = Ax-b$ and $g(u) = (1/2) \|u\|^2$. The derivatives of $h$ and $g$ are $h'(x) = A$ and $g'(u) = u^T$. So by the chain rule $$ f'(x) = g'(h(x)) h'(x) = (Ax-b)^T A. $$ The gradient of $f$ is $$ \nabla f(x) = f'(x)^T = A^T(Ax-b). $$

answered Jan 14 '20 at 06:37

littleO

51,938

Thanks for the answer. Where did the transpose in $g'(u) = u^T$ come from? – The Pointer Jan 14 '20 at 06:43
If $F:\mathbb R^n \to \mathbb R^m$ is differentiable at $x$, then $F'(x)$ is an $m \times n$ matrix. So, since $g:\mathbb R^n \to \mathbb R$, we see that $g'(u)$ is a $1 \times n$ matrix (a row vector). – littleO Jan 14 '20 at 06:56
Thank you for the clarification. – The Pointer Jan 14 '20 at 07:30

Taking the gradient of $f(\mathbf{x}) = \frac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2$

1 Answers1

Linked