Gradient of least-squares cost — how to compute it?

Question

I have a system $A x ~ b$ where vector $b$ is not actually in the span of matrix $A$. I want to use a least squares approach to minimize the distance between the two vectors.

$$\begin{aligned} \min_x \Vert \mathbf Ax - b\Vert ^2 &= (\mathbf Ax-b)^T (\mathbf Ax-b) \\ &= x^T \mathbf A^T \mathbf A x - (\mathbf Ax)^Tb - b^T \mathbf Ax + b^Tb \\ &= x^T \mathbf A^T \mathbf A x - 2b^T \mathbf Ax + b^Tb \end{aligned}$$

I'm struggling to understand how to arrive at

$$0 = 2 \mathbf A^T \mathbf Ax -2 \mathbf A^T b$$

I understand that some sort of derivative with respect to $x$ has been taken in order to minimize the distance, but what does it mean to take a derivative of a matrix product with respect to a vector? I have never come across such a calculation?

Thanks in advance.

I think you have a few Ts missing there, mainly in the first term. — ConMan, Mar 23 '21 at 22:58
What you are looking at is the derivative of a scalar wrt a vector: https://m.youtube.com/watch?v=iWxY7VdcSH8 — player100, Mar 23 '21 at 23:14
@RodrigodeAzevedo The similar question you posted was useful. I'm still struggling to understand where the '2' in the first term arises from after taking the derivative? — S H, Mar 24 '21 at 11:38
@SH I would say that it is similar to taking the derivative of real function $x \mapsto a x^2$. One thing to keep in mind is that a quadratic form can be written in terms of a symmetric matrix and that the skew-symmetric part contributes nothing to the quadratic form. Take a look at my answer to the linked question. There are 70 questions also linking to that question. If you search that list, you may find other good material. — Rodrigo de Azevedo, Mar 24 '21 at 11:40

score 1 · Accepted Answer · answered Mar 23 '21 at 23:12

Yes, you can indeed perform matrix calculus, which is exactly what it sounds like. In this case, we're differentiating a scalar value (since it's a norm, which is just a number) by the vector $x$, and the way you do that is surprisingly simple - you differentiate by each of the vector's components, and make a vector out of that. In other words, if $x = \left( x_1, x_2, \ldots, x_n \right)$ then $\frac{\partial}{\partial x} = \left(\frac{\partial}{\partial x_1}, \frac{\partial}{\partial x_2}, \ldots, \frac{\partial}{\partial x_n} \right)$.

It might not be entirely clear how that's going to work when you've got matrix operations happening under the hood there, but because matrices and derivatives are both linear operators (i.e. they behave nicely when you add things together or multiply them by constants) stuff mostly follows sensible rules, and in particular if you go to the section in that Wikipedia article on "Identities" you'll find the bit you're looking for under "Scalar-by-vector identities", in particular:

$\mathbf{A}$ is not a function of $\mathbf{x}$, $\mathbf{A}$ is symmetric: $\frac{\partial \mathbf{x}^\top\mathbf{A}\mathbf{x}}{\partial \mathbf{x}} = 2\mathbf{x}^\top\mathbf{A}$

Gradient of least-squares cost — how to compute it?

1 Answers1