How to calculate the gradient of $x^T A x$?

Question

I am watching the following video lecture:

https://www.youtube.com/watch?v=G_p4QJrjdOw

In there, he talks about calculating gradient of $ x^{T}Ax $ and he does that using the concept of exterior derivative. The proof goes as follows:

$ y = x^{T}Ax$
$ dy = dx^{T}Ax + x^{T}Adx = x^{T}(A+A^{T})dx$ (using trace property of matrices)
$ dy = (\nabla y)^{T} dx $ and because the rule is true for all $dx$
$ \nabla y = x^{T}(A+A^{T})$

It seems that in step 2, some form of product rule for differentials is applied. I am familiar with product rule for single variable calculus, but I am not understanding how product rule was applied to a multi-variate function expressed in matrix form.

It would be great if somebody could point me to a mathematical theorem that allows Step 2 in the above proof.

Thanks! Ajay

Thanks. It's product rule indeed. But, where can I find a proof for the product rule for multivariable functions using differentials? I am not finding http://en.wikipedia.org/wiki/Product_rule convincing because uses the term differential very loosely. Is there a product rule proof using differential form properties? — user855, Sep 03 '13 at 05:40
$d(fg)=\sum_{i=1}^{n}\frac{\partial(fg)}{\partial x_{i}}dx_{i}=\sum_{i=1}^{n}(\frac{\partial f}{\partial x_{i}}g+f\frac{\partial g}{\partial x_{i}})dx_{i}=(\sum_{i=1}^{n}\frac{\partial f}{\partial x_{i}}dx_{i})g+f(\sum_{i=1}^{n}\frac{\partial g}{\partial x_{i}}dx_{i}=gdf+fdg)$ — user71352, Sep 03 '13 at 05:56
http://thousandfold.net/cz/2013/11/12/a-useful-trick-for-computing-gradients-w-r-t-matrix-arguments-with-some-examples/ — venrey, Mar 31 '19 at 08:20
Check out the resources on matrix differentials listed at the bottom of this question: https://math.stackexchange.com/questions/4105372/clean-simple-way-of-computing-nabla-fu-u-mathsft-with-respect-to-u/4105679#4105679 — kanso37, Apr 18 '21 at 03:31

score 22 · Accepted Answer · edited Feb 05 '19 at 12:48

22

\begin{align*} dy & = d(x^{T}Ax) = d(Ax\cdot x) = d\left(\sum_{i=1}^{n}(Ax)_{i}x_{i}\right) \\ & = d \left(\sum_{i=1}^{n}\sum_{j=1}^{n}a_{i,j}x_{j}x_{i}\right) =\sum_{i=1}^{n}\sum_{j=1}^{n}a_{i,j}x_{i}dx_{j}+\sum_{i=1}^{n}\sum_{j=1}^{n}a_{i,j}x_{j}dx_{i} \\ & =\sum_{i=1}^{n}(Ax)dx_{i}+\sum_{i=1}^{n}(Adx)x_{i} =(dx)^{T}Ax+x^{T}Adx \\ & =(dx)^{T}Ax+(dx)^{T}A^{T}x =(dx)^{T}(A+A^{T})x. \end{align*}

edited Feb 05 '19 at 12:48

ViktorStein

4,838

answered Sep 03 '13 at 05:39

user71352

13,038

The lecture says that using the properties of external definition of derivative, we can avoid representing it as "sums" and this simplifies the whole calculation – user855 Sep 03 '13 at 05:46
The above comments by others indicate that he applied the product rule for differentials. Unfortunately, I am unable to find a proof for the result of product rule for differentials. Any help there would be great. – user855 Sep 03 '13 at 05:47
I wonder why $$d(x^{T}Ax)=d(Ax\cdot x)$$ – Allan Ruin Nov 21 '15 at 01:38
12

@AllanRuin I know this is rather late but: by definition $x^T Ax$ is exactly the same as taking the dot product of $x$ with $Ax$. This can be written as $x^TAx = x \cdot Ax = Ax\cdot x$. – NoseKnowsAll Oct 13 '16 at 06:20

Did · Answer 2 · 2013-09-03T06:09:22.347

Step 2 might be the result of a simple computation. Consider $u(x)=x^TAx$, then $$ u(x+h)=(x+h)^TA(x+h)=x^TAx+h^TAx+x^TAh+h^TAh, $$ that is, $u(x+h)=u(x)+x^T(A+A^T)h+r_x(h)$ where $r_x(h)=h^TAh$ (this uses the fact that $h^TAx=x^TA^Th$, which holds because $m=h^TAx$ is a $1\times1$ matrix hence $m^T=m$).

One sees that $r_x(h)=o(\|h\|)$ when $h\to0$. This proves that the differential of $u$ at $x$ is the linear function $\nabla u(x):\mathbb R^n\to\mathbb R$, $h\mapsto x^T(A+A^T)h$, which can be identified with the unique vector $z$ such that $\nabla u(x)(h)=z^Th$ for every $h$ in $\mathbb R^n$, that is, $z=(A+A^T)x$.

score 10 · Answer 3 · answered Sep 03 '13 at 05:58

Here's a method which calculates the gradient of $x^TAx$ without using the exterior derivative. I know that this is not what you are after, but it is worth noting how to prove it without the exterior derivative. This also allows for comparison with the exterior derivative method to see how much easier it is.

Let $A$ be $n\times n$, $A = [a_{ij}]$. If $x \in \mathbb{R}^n$, $x = (x_1, \dots, x_n)^T$, then $y = \displaystyle\sum_{i=1}^n\sum_{j=1}^na_{ij}x_ix_j$.

Then we have

\begin{align*} \dfrac{\partial y}{\partial x_k} &= \sum_{i\neq k}\dfrac{\partial}{\partial x_k}\left(\sum_{j=1}^na_{ij}x_ix_j\right) + \dfrac{\partial}{\partial x_k}\left(\sum_{j=1}^na_{kj}x_kx_j\right)\\ &=\sum_{i\neq k}\left(\dfrac{\partial}{\partial x_k}\left(\sum_{j\neq k}a_{ij}x_ix_j\right) + \dfrac{\partial}{\partial x_k}(a_{ik}x_ix_k)\right) + \sum_{j\neq k}\dfrac{\partial}{\partial x_k}(a_{kj}x_kx_j) + \dfrac{\partial}{\partial x_k}(a_{kk}x_k^2)\\ &= \sum_{i\neq k}a_{ik}x_i + \sum_{j\neq k}a_{kj}x_j + 2a_{kk}x_k\\ &= \sum_{i = 1}^na_{ik}x_i + \sum_{j=1}^na_{kj}x_j\\ &= (x^TA)_k + (Ax)_k \end{align*}

where $(x^TA)_k$ is the $k^{\text{th}}$ component of the row vector $x^TA$ and $(Ax)_k$ is the $k^{\text{th}}$ component of the column vector $Ax$. By taking the transpose of $Ax$ we obtain the row vector $x^TA^T$ which has the same $k^{\text{th}}$ component as $Ax$ does. Therefore $\dfrac{\partial y}{\partial x_k} = (x^TA)_k + (x^TA^T)_k$. Therefore

$$\nabla y = x^TA + x^TA^T = x^T(A + A^T).$$

littleO · Answer 4 · 2015-11-21T09:36:36.497

Another approach is to use a multivariable product rule. Suppose $g$ and $h$ are differentiable functions from $\mathbb R^n$ to $\mathbb R^m$, and $f(x) = \langle g(x), h(x) \rangle$ for all $x \in \mathbb R^n$. Then if $\Delta x \in \mathbb R^n$ is small we have \begin{align} f(x+\Delta x) &= \langle g(x+\Delta x), h(x+\Delta x) \rangle\\ &\approx \langle g(x) + g'(x) \Delta x, h(x) + h'(x) \Delta x \rangle \\ &= \langle g(x), h(x) \rangle + \langle h(x), g'(x) \Delta x \rangle + \langle g(x), h'(x) \Delta x \rangle + \text{small term} \\ &\approx f(x) + \langle g'(x)^T h(x),\Delta x \rangle + \langle h'(x)^T g(x), \Delta x \rangle \\ &= f(x) + \langle g'(x)^T h(x) + h'(x)^T g(x),\Delta x \rangle. \end{align} This suggests that \begin{equation} \nabla f(x) = g'(x)^T h(x) + h'(x)^T g(x). \end{equation} This is our multivariable product rule. (This derivation could be made into a rigorous proof by keeping track of error terms.)

In the case where $g(x) = x$ and $h(x) = Ax$, we see that \begin{align} \nabla f(x) &= Ax + A^T x \\ &= (A + A^T) x. \end{align}

(Edit) Explanation of notation:

Let $f:\mathbb R^n \to \mathbb R^m$ be differentiable at $x \in \mathbb R^n$. Then $f'(x)$ is the $m \times n$ matrix defined informally by \begin{equation} f(\underbrace{x}_{n \times 1} + \underbrace{\Delta x}_{n \times 1}) \approx \underbrace{f(x)}_{m \times 1} + \underbrace{f'(x)}_{m \times n} \underbrace{\Delta x}_{n \times 1}. \end{equation} (The approximation is good when $\Delta x$ is small.)

When $f:\mathbb R^n \to \mathbb R$, $f'(x)$ is a $1 \times n$ row vector. The gradient of $f$ at $x$, defined by $\nabla f(x) = f'(x)^T$, is an $n \times 1$ column vector. Notice that, in this case, \begin{equation} f(x + \Delta x) \approx f(x) + \underbrace{f'(x)}_{1 \times n} \underbrace{\Delta x}_{n \times 1} = f(x) + \langle \nabla f(x), \Delta x \rangle. \end{equation}

I have some difficulties in understanding the derivations below the row ends with "+ small term" in the equation array, could you give me some hints? Thank you! — craftsman.don, Oct 27 '13 at 13:34
I am confused by the ' in $$\nabla f(x) = g'(x)^T h(x) + h'(x)^T g(x)$$, is it $$ ' = \nabla $$ ? — Allan Ruin, Nov 21 '15 at 01:58
@AllanRuin I added an explanation of notation at the end, hopefully it helps. — littleO, Nov 21 '15 at 09:37
can you clarify what $f$ is? Also is $x$ a row or a column vector? Same with $x^\top$ is that a row or a column vector? What are the dimensions of $A$? — Charlie Parker, May 09 '19 at 16:26
@CharlieParker $x$ is an $n \times 1$ column vector, $x^T$ is a $1 \times n$ row vector, $A$ is an $n \times n$ matrix, and $f(x) = x^T A x = \langle x, Ax \rangle$. — littleO, Apr 19 '21 at 20:12

score 3 · Answer 5 · edited Feb 05 '19 at 12:51

The exterior derivative has nothing to do here. How could a student understand such a proof ! "Did" gave a good answer.

The gradient $\nabla(f)$ of a function $f: E\rightarrow \mathbb{R}$ is defined, modulo a dot product $\langle \cdot, \cdot \rangle$ on the vector-space $E$, by the formula $$ \langle \nabla(f)(x), h \rangle = Df_x(h), $$ where $Df_x$ is the derivative of $f$ in $x$.

Example 1: Let $f:x\in\mathbb{R}^n \rightarrow x^TAx\in\mathbb{R}$. Then, $Df_x(h)=h^TAx+x^TAh=x^T(A+A^T)h$ (it's the derivative of a non-commutative product!); we consider the dot product $u.v=u^Tv$.

Thus, $Df_x(h)= \langle ((A+A^T)x), h \rangle $ and $\nabla(f)(x)=(A+A^T)x$, that is $\nabla(f)=A+A^T$.

Example 2: Let $f:X\in\mathcal{M}_n(\mathbb{R}) \rightarrow \operatorname{Trace}(X^TAX)\in\mathbb{R}$, where $\mathcal{M}_n(\mathbb{R})$ is the set of all $n \times n$ Matrices on $\mathbb{R}$.

Since Trace is a linear function, we have $$ Df_X(H) = \operatorname{Trace}(H^TAX+X^TAH) = \operatorname{Trace}(X^T(A+A^T)H); $$ we consider the dot product $\langle U,V \rangle = Trace(U^TV)$.

Thus, $Df_X(H)=\langle ((A+A^T)X), H \rangle$ and $\nabla(f)(X)=(A+A^T)X$, that is $\nabla(f)=(A+A^T)\otimes I$. (Kronecker product).

Example 3 (more difficult): Let $f:X\in\mathcal{M}_n(\mathbb{R}) \rightarrow \det(X)\in\mathbb{R}$.

The we have $$ Df_X(H) = \operatorname{Trace}(\operatorname{adjoint}(X)H) = \langle \operatorname{adjoint}(X)^T, H \rangle \quad \text{and} \quad \nabla(f)(X) = \operatorname{adjoint}(X)^T. $$

Example 4: Let $f:X\in\mathcal{M}_n(\mathbb{R}) \rightarrow X^TAX\in\mathcal{M}_n(\mathbb{R})$.

Then we have $Df_X(H)=H^TAX+X^TAH$. Here the gradient of $f$ does not exist. In a pinch, we can define $n^2$ gradients, the $\nabla(f_{i,j})$ (componentwise) but these functions have no geometric meanings.

score 2 · Answer 6 · answered Oct 09 '18 at 00:40

I think it's pretty straightforward. First consider $x^TAx = (x^T)(Ax)$, so that you now have $x^TAx$ as product of two functions.

And it's easy to see that $\dot{(x^T)} = \dot x^T$, and $\dot{(Ax)} = A\dot x$

So now you have $d(x^TAx) = (dx)^TAx + x^TA(dx)$, which is essentially what you want.

Aditya P · Answer 7 · 2018-08-23T11:57:37.483

First consider the two ways a unknown vector can be dotted with a scalar vector to get a scalar $k$.

$$k = \vec x^T\vec a = \vec a^T\vec x = [x_1a_1 +x_2a_2 + ... + x_na_n]_{1\times 1}$$

Define the derivative of scalar w.r.t to a vector in most natural manner possible.

$$\frac{dk}{d\vec x} = \left [\frac{dk}{dx_1}, \frac{dk}{dx_2}, \cdots, \frac{dk}{dx_n} \right ] = \vec a^T_{1 \times n}$$

Now using the above, apply it to find the gradient/derivative of $\vec x^TA\vec x$. Apply product rule, note that the terms in the bracket are considered as constants when differentiating. $$\frac{d(\vec x^TA\vec x)}{d\vec x} = \frac{d((\vec x^TA)_{1\times n}\vec x_{n \times 1})}{d\vec x} + \frac{d(\vec x^T(A\vec x))}{d\vec x}$$

Note how $\vec x^TA$ and $A\vec x$ can now be treated as a scalar vector and apply the above formula for derivative of scalar w.r.t to vector.

$$\frac{d(\vec x^TA\vec x)}{d\vec x} = (\vec x^TA) + (A\vec x)^T = \vec x^T(A+A^T)$$

Here we can better understand the product rule at work, if we elaborate what is happening there, let $A$ be thought of as having $\vec r_1, \cdots, \vec r_n$ as the row vectors, each of size $1 \times n$ or being made of $\vec c_1, \cdots, \vec c_n$ as column vectors, each of size $n \times 1$.

$$ \vec x^TA\vec x = \left [ \begin{matrix} x_1 & x_2 & \cdots & x_n \\ \end{matrix} \right ] \left [ \begin{matrix} \vec r_1 \\ \vec r_2 \\ \cdots \\ \vec r_n \\ \end{matrix} \right ]_{n\times n} \left [ \begin{matrix} x_1 \\ x_2 \\ \cdots \\ x_n \\ \end{matrix} \right ] $$

From this we can multiple either $\vec x^T$ to $A$ first or $A$ to $x$ first, giving us 2 ways of seeing this scalar quantity,

$$= \left [ \sum_{i=1}^n x_i\vec r_i \right ]_{1 \times n} \vec x_{n \times 1} = \vec x^T_{1 \times n} \left [ \sum_{i=1}^n x_i\vec c_i \right ]_{n \times 1} $$

Thus we see how the two terms of $A$ and $A^T$ appear, since in one term we have the rows and in the other we have the columns

How to calculate the gradient of $x^T A x$?

7 Answers7

Linked