2

How can I compute the gradient of f(x)= $||Ax-b||_{R^{-1}}^2$ I'm also confused about how to compute the gradient of g(x) = $||y-Ax||$ using the chain rule.

I think the first step to take the gradient of g(x) would be $\nabla(g(x)) = \frac{d(y-Ax)^\top}{dx}(y-Ax)+ \frac{d(y-Ax)}{dx}(y-Ax)^\top$. However, I'm unsure of how to take the derivative of (y-Ax) or the derivative of $(y-Ax)^\top$. Any help is highly appreciated.

  • 1
    I derived the gradient in the case $R = I$ here: https://math.stackexchange.com/a/4093964/40119 The derivation can be adapted to your problem by taking $g(u) = | u |_{R^{-1}}^2$. – littleO Apr 08 '21 at 23:29

3 Answers3

2

It is easy to see that $D(||x||^2)(x) = 2x^T$, where $D$ denotes the (total) dervative. The gradient is the transpose of the derivative. Also $D(Ax + b)(x) = A$. By the chain rule, $Df(x) = 2(Ax - b)^TA$. Thus $\nabla f(x) = Df(x)^T = 2A^T(Ax - b)$.

To compute $Dg(x)$, it will be helpful to first compute $D(||x||)(x)$. By the chain rule, \begin{align} D(||x||)(x) &= D(\sqrt{||x||^2})(x) \\ &= \frac{1}{2}(||x||^2)^{-1/2}2x^T \\ &= \frac{1}{||x||}x^T. \end{align} Now the derivative of $g$ is easy to obtain using the chain rule: $Dg(x) = \frac{1}{||y - Ax||}(y - Ax)^T(-A)$. So $\nabla g(x) = -\frac{1}{||y - Ax||}A^T(y - Ax)$.

Edit: I guess you meant $f(x) = (Ax - b)^TR^{-1}(Ax - b)$. To compute the derivative of $f$, it is convenient to first compute the derivative of $q(x) = x^TBx$, where $B$ is any matrix. For any vector $y$ we have \begin{align} q(x + y) &= (x + y)^TB(x + y) \\ &= (x^T + y^T)(Bx + By) \\ &= x^TBx + x^TBy + y^TBx + y^TBy \\ &= q(x) + x^TBy + (Bx)^Ty + O(|y|^2) \\ &= q(x) + x^T(B + B^T)y + O(|y|^2). \end{align} Hence $Dq(x) = x^T(B + B^T)$. Note with $B = R^{-1}$, $f(x) = q(Ax - b)$. Hence by chain rule, $$Df(x) = (Ax - b)^T(R^{-1} + (R^{-1})^T)A.$$ Taking the transpose gives $$\nabla f(x) = A^T((R^{-1})^T + R^{-1})(Ax - b).$$

Mason
  • 10,415
  • Thank you! I think for the gradient of f(x) we need to take the $R^{-1}$ into account. If the derivative of (Ax+b)(x) =A then does that mean the gradient of (Ax+b)(x) = $A^{\top}$? Since the $||x||_{2}^{2} = x^{\top}x $ wouldn't the derivative be a scalar since $x^{\top}x$ results in a 1x1 matrix or in other words a scalar? – Mush Mush Apr 08 '21 at 20:21
  • @MushMush 1. What does the $R^{-1}$ subscript mean? 2. The derivative $Df$ (aka Jacobian) applies for $f \colon \mathbb{R}^n \to \mathbb{R}^m$. At each $x$ in the domain of $f$, $Df(x) \colon \mathbb{R^n} \to \mathbb{R^m}$ is a linear map, which is represented by a $m \times n$ matrix. The gradient only applies when $f \colon \mathbb{R}^n \to \mathbb{R}$, and is defined as $\nabla f(x) := Df(x)^T$. So the gradient of $f(x) = Ax + b$ isn't defined unless $A = v^T$ for some vector $v$ and $b \in \mathbb{R}$, and in this case the gradient is indeed $v$. – Mason Apr 08 '21 at 22:47
  • $||−||_{2}^{^{−1}} = (Ax-b)^{\top}R^{-1}(Ax-b)$. R is a nxn matrix. A is a nxm matrix. b is a mx1 vector. Are you saying it's not possible to find the gradient of this norm? I know the least squares problem is supposed to correspond to normal equations and I was told that I could find the normal equations that the least square problem corresponded to by taking the gradient. – Mush Mush Apr 09 '21 at 19:28
  • @MushMush The function you wrote is a scalar valued function, so it has a gradient. I updated the answer to include the computation for $\nabla f$. – Mason Apr 09 '21 at 19:59
0

The gradient of the norm squared is just $\nabla||x||^2=\nabla (x_1^2+\dots +x_n^2)=(\partial/\partial x_1 (x_1^2+\dots +x_n^2),\dots,\partial/ \partial x_n (x_1^2+\dots+x_n^2))=(2x_1,\dots ,2x_n)=2 (x_1,\dots,x_n) $.

So substitute $y-Ax $ for $x $.

0

For a vector $a$,

$$\nabla(a\cdot x)=a$$

because if you expand the dot product and differentiate on a component $x_i$, what remains is the component $a_i$ of $a$.

Now seeing the matrix $a$ as an aggregation of column vectors, $$\nabla(Ax)=A.$$

From this,

$$\nabla\|Ax-b\|^2=\nabla((Ax-b)^T(Ax-b)) \\=\nabla((Ax)^TAx-b^T(Ax)-(Ax)^Tb+b^2) \\=\nabla((Ax)^TAx-b^T(Ax)-(Ax)^Tb+b^Tb) \\=\nabla(Ax)^TAx+(Ax)^T\nabla(Ax)-2A^Tb \\=A^TAx+(Ax)^TA-2A^Tb \\=2A^T(Ax)-2A^Tb.$$

  • Thank you! I was wondering how you got that this equation is equal: $∇(()^\top−^\top()−()^\top+^\top) =∇()^\top+()^\top∇()−2^\top$ This equation $^\top+()^\top−2^\top=2^\top()−2^\top$ indicates that $x^{\top}A^{\top}A = A^{\top}Ax$ correct? How do you know $x^{\top}A^{\top}A = A^{\top}Ax$? – Mush Mush Apr 08 '21 at 20:13