4

Section 4.5 of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says that the gradient of

$$f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2$$

is

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b}$$

My understanding is that $f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2$ is the square of the Euclidean norm. So we have that

$$\begin{align} f(\mathbf{x}) = \dfrac{1}{2}\|\mathbf{A} \mathbf{x} - \mathbf{b}\|_2^2 &= \dfrac{1}{2} \left( \sqrt{(\mathbf{A} \mathbf{x} - \mathbf{b})^2} \right)^2 \\ &= \dfrac{1}{2} (\mathbf{A} \mathbf{x} - \mathbf{b})^2 \\ &= \dfrac{1}{2} (\mathbf{A} \mathbf{x} - \mathbf{b})(\mathbf{A} \mathbf{x} - \mathbf{b}) \\ &= \dfrac{1}{2} [ (\mathbf{A}\mathbf{x})(\mathbf{A} \mathbf{x}) - (\mathbf{A} \mathbf{x})\mathbf{b} - (\mathbf{A} \mathbf{x})\mathbf{b} + \mathbf{b}^2 ] \ \ \text{(Since matrix multiplication is distributive.)} \\ &= \dfrac{1}{2} [(\mathbf{A} \mathbf{x})^2 - 2(\mathbf{A} \mathbf{x})\mathbf{b} + \mathbf{b}^2] \ \ \text{(Note: Matrix multiplication is not commutative.)} \end{align}$$

It's at this point that I realised that, since we're working with matrices, I'm not really sure how to take the gradient of this. Taking the gradient of $f(\mathbf{x})$ with respect to $\mathbf{x}$, we get something like

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \dfrac{1}{2} [2 (\mathbf{A} \mathbf{x}) \mathbf{A}] - \dfrac{1}{2}[2(\mathbf{A} \mathbf{A} \mathbf{x})\mathbf{b}]$$

So what is the reasoning that leads us to get $\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b}$? Where did the transposed matrices come from?

I would greatly appreciate it if people would please take the time to clarify this.

The Pointer
  • 4,182
  • $Ax-b$ is a vector. In your link the Euclidean norm of a vector is defined. It doesn't match up with what you did. (How is the square of s vector defined?) – ViktorStein Jan 14 '20 at 06:33
  • Have a look at this wiki page, under the subsection "Derivatives with vectors". – ViktorStein Jan 14 '20 at 06:36
  • Also there are a lot of similar questions on this site. Have you had a look at them (on approach0.xyz)? – ViktorStein Jan 14 '20 at 06:37
  • Related: https://math.stackexchange.com/q/3424167/339790 and https://math.stackexchange.com/q/3493984/339790 and https://math.stackexchange.com/q/3420325/339790 – Rodrigo de Azevedo Jan 15 '20 at 10:27

1 Answers1

5

We must take the derivative with finesse, and that means we use the chain rule. Note that $f = g \circ h$, where $h(x) = Ax-b$ and $g(u) = (1/2) \|u\|^2$. The derivatives of $h$ and $g$ are $h'(x) = A$ and $g'(u) = u^T$. So by the chain rule $$ f'(x) = g'(h(x)) h'(x) = (Ax-b)^T A. $$ The gradient of $f$ is $$ \nabla f(x) = f'(x)^T = A^T(Ax-b). $$

littleO
  • 51,938