3

I'm not very familiar with multivariable calculus as it relates to matrices. Could someone explain, in detail, why $$\frac{\partial}{\partial x} \left[ x^T A x \right] = (A + A^T)x$$ In the case of a symmetric matrix and $$\frac{\partial}{\partial x} \left[ x^T A x \right] = 2Ax$$ if the matrix is not symmetric. I'm mainly confused about how we even arrive at the first derivative. However, I understand how the first derivative simplifies to the second in the case that A is symmetric.

  • The derivative of a function $f:\Bbb{R^n}\to\Bbb{R^m}$ is always an $m\times n$ linear map (matrix). $f(x) = x^TAx$ is a function $f:\Bbb{R^n}\to\Bbb{R}$, so its derivative should be a $1\times n$ matrix, a row vector. Machine learning books always take the transpose of the real derivative. – Ninad Munshi Jan 21 '21 at 06:55
  • I think your examples are the wrong way round. Of course, for a symmetric matrix $A$, $A^T + A = 2A$, so it should probably be $(A+A^T)x$ in the general case, from which you easily deduce that in the symmetric case the answer is $2Ax$. – preferred_anon Jan 21 '21 at 15:08
  • https://math.stackexchange.com/q/312077/321264, https://math.stackexchange.com/q/222894/321264 – StubbornAtom Jan 21 '21 at 19:32

4 Answers4

3

Some facts and notations:

  • Trace and Frobenius product relation $$\left\langle A, B C\right\rangle={\rm tr}(A^TBC) := A : B C$$
  • Cyclic properties of Trace/Frobenius product \begin{align} A : B C &= BC : A \\ &= B^T A : C \\ &= {\text{etc.}} \cr \end{align}

Let $f := x^T A x = x:Ax$.

Compute the differential first, and then the gradient can be obtained from it. \begin{align} df &= dx:Ax + x: A dx \\ &= Ax:dx + A^Tx:dx \\ &= (A + A^T)x:dx \end{align}

Thus, the gradient is \begin{align} \frac{\partial }{\partial x} \left( x^T Ax \right)= (A + A^T)x. \end{align}

When $A$ is symmetric, i.e., $A^T = A$, then the gradient is $\frac{\partial }{\partial x} \left( x^T Ax \right)= 2Ax$.

user550103
  • 2,688
3

An alternative approach, though similar: we have

$$\begin{align}f(x+h)&=(x+h)^TA(x+h)\\ &=x^TAx+x^TAh+h^TAx+h^TAh\\ &=x^TAx+x^T(A+A^T)h+h^TAh\\ &=f(x)+\mathrm Df(x)h+o(\vert h\vert), \end{align}$$

where $\mathrm Df(x):h\mapsto x^T(A+A^T)h$ is linear and $h^TAh\in o(\vert h\vert)$. And thus the linear map $\mathrm Df(x)$ is the derivative of $f$ at $x$. The version you were given is the transpose of its matrix representation. If $A$ is symmetric, $A+A^T=2A$.

Vercassivelaunos
  • 13,226
  • 2
  • 13
  • 41
2

We have $$ \begin{split} \frac{\partial}{\partial x} \left[ x^T A x \right]v &= \lim_{h\to0}\frac{(x+hv)^T A (x+hv)-x^T A x}{h} \\ &= \lim_{h\to0}\frac{x^T A hv+(hv)^T A x+(hv)^T A hv}{h} \\ &= x^T A v+v^T A x \\ &= x^T A v+x^T A^T v \\ &= x^T(A + A^T)v \end{split} $$ and so $$ \frac{\partial}{\partial x} \left[ x^T A x \right]= x^T(A + A^T). $$ You can see the row vector $x^T(A + A^T)$ as the column vector $(x^T(A + A^T))^T=(A + A^T)x$, but strictly speaking it is not the same.

John B
  • 16,854
1

Suppose $x=(x_1,\ldots,x_n)^T$ and $A=(a_{ij})$, by calculating partial derivative w.r.t the $k^{th}$ component, we have $$ \frac {\partial x^T A x}{\partial x_k} $$ $$ = \frac {\partial (\sum_{ij} x_i a_{ij}x_j)}{\partial x_k}$$ $$= \sum_j a_{kj}x_j +\sum_i x_i a_{ik} $$ $$ =\sum_j (a_{kj} + a_{jk})x_j$$ $$ =[(A+A^T)x]_k$$ Hence, $\frac {\partial x^T A x}{\partial x}=(A+A^T)x$.

The case when $A$ is not symmetric can be understand with tensor language:

If $A$ is not symmetric, $a_{ij}$ is not equal to $a_{ji}$ in general. In the expression $x_i a_{ij} x_j$, we need to differentiate $x_i$, $x_j$ respectively. This gives $a_{kj}x_j$ and $x_i a_{ik}$ in the above. Notice that $j$ is dummy index, which can be substituted by any symbol, we exchange $i$ with $j$ and combine the two items. It yields $(a_{jk} + a_{kj})x_j$, where both left and right index of $a_{ij}$ is cancel exactly once. So that when $A$ is not symmetric, $a_{jk}+a_{kj}$ is not in general $2a_{jk}$, which is where the difference takes place.