Confused about computing the gradient of least-squares cost

Question

Given matrix $A \in \mathbb R^{m \times n}$ and vector $y \in \mathbb R^m$, I want to take the gradient of the following scalar field with respect to $x\in \mathbb R^n$.

$$x \mapsto \big((Ax - y)^T(Ax - y) \big),$$

$\textbf{Attempt}.$ \begin{align} \frac{\partial}{\partial x} \big((Ax - y)^T(Ax - y) \big) &= \frac{\partial}{\partial x} \big( (x^TA^TAx - x^TA^Ty - y^TAx+ y^Ty )\big)\\ &= \frac{\partial}{\partial x}x^TA^TAx - \frac{\partial}{\partial x}x^TA^Ty - \frac{\partial}{\partial x}y^TAx+ \frac{\partial}{\partial x}y^Ty \\ &= 2 A^TAx - A^Ty - y^TA\qquad\,\,\mathbf{(1*)}\\ &= 2 A^TAx - 2A^Ty. \qquad\qquad\,\mathbf{(2*)}\\ \end{align}

$\textbf{Question}.$ There are two expressions above marked by $(*)$. I don't understand the justification in going from $(1*)$ to $(2*)$ (in fact, the dimensions don't make sense...), which makes me think that there is a mistake in $(1*)$. Can someone explain the basics involved in these matrix manipulations?

Yes, those dimensions are all messed up. Working on figuring out how transposing interacts with derivatives. $\partial /\partial x$ is a derivative with respect to a column vector. — RobertTheTutor, Apr 07 '21 at 19:55
This is actually fascinating. I'm writing things out explicitly for a $3x3$ case. The first term is correct: $\partial / \partial \overrightarrow{x} = 2 A^TAx$ — RobertTheTutor, Apr 07 '21 at 20:05
By the way, this step shows up in a bunch of simple optimization/least-squares stuff, but the core mathematics involved are always just brushed over...which makes it so that I never know what's going on! Thanks for the help! Edit: Yes, sorry, I should have mentioned more clearly: the first term makes total sense to me...it's the other two that don't make sense to me! Sorry about that. — apologies, Apr 07 '21 at 20:07
Ok, here's more: $\partial_x (y^T A) x = \partial_x x^T (y^TA)^T = (y^T A)^T = A^T y$. Which is what we want! But...I'm still not satisfied with why these steps follow... — apologies, Apr 07 '21 at 20:20
Does this answer your question? How to take the gradient of the quadratic form? — Rodrigo de Azevedo, Apr 07 '21 at 21:12
There are dozens of duplicates of this question. Here is a recent one. — Rodrigo de Azevedo, Apr 07 '21 at 21:15
You're using the wrong approach. I believe the right approach is to use directional derivatives. — Rodrigo de Azevedo, Apr 08 '21 at 12:47

RobertTheTutor · Answer 1 · 2021-04-07T23:26:35.343

I have explicitly written out some cases by hand, and here are results, starting with the obvious:

$x$ is $(nx1)$, $y$ is $(mx1)$,

$x^T$ is $(1xn)$, $y^T$ is $(1xm)$

$A$ is $(mxn)$, $A^T$ is $(nxm)$

$A^TA$ is $(nxn)$, $A^TAx$ is $(nx1)$, and $x^TA^TAx$ is a scalar.

$A^Ty$ is $(nx1)$, $x^TA^Ty$ is a scalar, $y^TA$ is $(1xn)$, $y^TAx$ is a scalar.

Taking the partial derivative of a scalar s with respect to the vector $x$ means to create a column vector $$\begin{bmatrix}\frac{\partial s}{\partial x_1} \\\frac{\partial s}{\partial x_2} \\ ...\\ \frac{\partial s}{\partial x_n}\end{bmatrix} $$

Writing them out specifically, finding the various scalars, taking their partial derivatives, and recognizing the results, I find $$\partial/\partial x(x^TA^TAx) = 2A^TAx$$ $$\partial/\partial x (x^TA^Ty) = A^Ty$$. which surprised me. $$\partial/\partial x (y^TAx) = (y^TA)^T = A^Ty$$ So the pieces do match up.

Trying to find general rules, I am using the matrix calculus entry in Wikipedia. Writing out a case, I find that $$\partial/\partial x (Ax) = A^T$$. while $$\partial/\partial x (x^TB) = B$$

Applying those rules gives the last two results immediately, and also $$\partial/\partial x (x^TATAx) = A^TAx + (x^TA^TA)^T = A^TAx + A^TAx = 2A^TAx$$ as expected.

mathcounterexamples.net · Answer 2 · 2021-04-08T15:42:14.987

Computation of the derivative

Denote $f(x)=(Ax-y)^T(Ax-y)$.

$f(x) = (g \circ h)(x)$ with

$$h: x \mapsto (Ax-y, Ax-y)$$ (where parenthesis denotes an ordered pair) and

$$g: (u,v) \mapsto u^T v .$$

$h$ is a linear map whose derivative is given by

$$h^\prime(x)(k) = (Ak, Ak).$$ $g$ is a bilinear map whose derivative is

$$g^\prime(u,v)(k,l)= u^Tl+k^T v$$

Applying the chain rule, you get

$$\begin{aligned} f^\prime(x)(k) &=(Ax-y)^T Ak + (Ak)^T(Ax-y)\\ &=x^T A^TAk-y^TAk+k^TA^TAx-k^TA^Ty \end{aligned}$$

Now, the important thing to notice is that $k^TA^TAx, k^TA^Ty$ are real numbers.

Hence those are equal to their transpose and

$$\begin{aligned} f^\prime(x)(k) &=x^T A^TAk-y^TAk+k^TA^TAx-k^TA^Ty\\ &=x^T A^TAk-y^TAk+x^TA^TAk -y^TAk\\ &=2x^T A^TAk -2y^TAk\\ &=2(x^T A^TA -y^TA)k\\ &=2(A^TAx -A^Ty)^Tk\\ \end{aligned}$$

Which means in term of matrix calculus that indeed

$$\frac{\partial f}{\partial x}=2(A^TAx -A^Ty)=2A^T(Ax-y).$$ This is coherent with the matrix cookbook formula (84) where you take $W = Id$.

Some comments

A difficulty when you make direct matrix calculus without keeping track of the underlying derivative notion is that you lose in particular the side of the multiplication.

As an example, consider the two maps

$$l(x) = x^TA, \, r(x) = Ax$$ where $x \in \mathbb R^n$ and $A \in M_n(\mathbb R)$. The derivatives at $x$ are the maps

$$l^\prime(x)(k) = k^TA, \, r^\prime(x)(k) = Ak.$$ Those maps can't just be written as $A$.

This is related to layout conventions which you need to take into consideration if you want to use matrix calculus formulas. On my side, I prefer to go back to the derivative definitions and use those with the chain rule rather than using "already cooked formulas" where layout conventions (that I never remember!) are key for formulas veracity.

littleO · Answer 3 · 2021-04-08T20:36:51.187

I'm on a mission in life to show how to compute this gradient with finesse. The least squares cost function is $$ f(x) = \| Ax - y \|_2^2. $$ Notice that $f(x) = g(h(x))$ where $h(x) = Ax - y$ and $g(u) = \| u \|_2^2$. The derivatives of $g$ and $h$ are $g'(u) = 2 u^T$ and $h'(x) = A$. By the chain rule, $$ f'(x) = g'(h(x))h'(x) = 2(Ax - y)^T A. $$ If we use the convention that the gradient is a column vector, then $$ \nabla f(x) = f'(x)^T = 2 A^T(Ax - y). $$

Background info: If $F: \mathbb R^n \to \mathbb R^m$ is differentiable at a point $x \in \mathbb R^n$, then $F'(x)$ is an $m \times n$ matrix. The best way to think about this matrix is as follows: $$ \tag{1} F(x + \Delta x) \approx F(x) + F'(x) \Delta x. $$ The approximation is good when $\Delta x$ is small. This approximation is sometimes called "Newton's approximation" (for example in Tao's books Analysis I and II). Newton's approximation is the key idea of calculus.

If $m = 1$, so $F: \mathbb R^n \to \mathbb R$, then $F'(x)$ is a $1 \times n$ matrix (a row vector). In this case, if we use the convention that the gradient is a column vector, then $$ \nabla F(x) = F'(x)^T. $$

Most formulas of calculus can be derived easily by using this approximation. For example, the chain rule can be understood as follows: if $f(x) = g(h(x))$ then \begin{align*} f(x + \Delta x) &= g(h(x + \Delta x)) \\ &\approx g(h(x) + h'(x) \Delta x) \\ &\approx \underbrace{g(h(x))}_{f(x)} + g'(h(x)) h'(x) \Delta x. \end{align*} Comparing this approximation with $f(x + \Delta x) \approx f(x) + f'(x) \Delta x$ reveals (or at least suggests) that $f'(x) = g'(h(x)) h'(x)$.

The derivative of the function $h(x) = Ax - y$ (where $A$ is an $m \times n$ matrix) can also be derived quickly using (1). Notice that $$ h(x + \Delta x) = A(x + \Delta x) - y = \underbrace{Ax - y}_{h(x)} + A \Delta x. $$ Comparing this with $h(x + \Delta x) \approx h(x) + h'(x) \Delta x$ reveals (or at least suggests) that $h'(x) = A$.

Likewise, the derivative of the function $g(u) = \| u \|_2^2$ can be understood using the approximation (1). Notice that $$ g(u + \Delta u) = \|u \|_2^2 + 2 u^T \Delta u + \underbrace{\| \Delta u \|_2^2}_{\text{negligible}} \approx \| u \|_2^2 + 2 u^T \Delta u. $$ Comparing this with the approximation $g(u + \Delta u) \approx g(u) + g'(u) \Delta u$ reveals that $g'(u) = 2 u^T$.

Interesting. I found that $\partial / \partial x (Ax-y) = A^T$, not $A$. — RobertTheTutor, Apr 08 '21 at 11:50
@RobertTheTutor I added some comments about how to discover the formula $h'(x) = A$ for the derivative of the function $h(x) = Ax$. — littleO, Apr 08 '21 at 18:14

lynn · Answer 4 · 2021-07-30T14:40:23.983

$ \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} $Rather than expanding the expression as your first step, it is often better to simplify the expression as much as possible before differentiating, and then expand things afterwards.

Towards that end, define a new vector $$w=Ax-y$$ which simplifies the function and makes the differentiation very easy $$\eqalign{ \phi &= w^Tw \\ d{\phi} &= 2w^Tdw = 2w^TA\,dx = (2A^Tw)^Tdx \\ \grad{\phi}{x} &= 2A^Tw = 2A^T(Ax-y) \\ }$$ and completely eliminates any confusion about transposes.

Confused about computing the gradient of least-squares cost

4 Answers4

Linked