0

I am referring to Tom Minka's Old and New Matrix Algebra Useful for Statistics. I don't have the book by Magnus & Neudecker so I can't refer to the details of the theory.

Regarding rules (6): $d(XY) = (dX)Y + X(dY)$ and (12): $dX^*=(dX)^*$, I am not clear how to apply them. My notation used is numerator layout, i.e. $\dfrac{dx}{dx} = I$

Question 1.

$f(x)=x^Tx$ , $\dfrac{df}{dx}=2x^T$

However, if I use $\dfrac{df}{dx}= x^T\dfrac{dx}{dx} + \dfrac{dx^T}{dx}x$, firstly, $\dfrac{dx^T}{dx}$ is $1^T$? Second, according to rule (12), $\dfrac{dx^T}{dx} = (\dfrac{dx}{dx})^T = I^T = I $?

Question 2.

$f(x) = x^TAx$

$\dfrac{df}{dx}=x^T\dfrac{dAx}{dx}+ \dfrac{dx^T}{dx}(Ax) = x^TA + ???$

$???$ is supposed to be $x^TA^T$, however, it seems to me no matter $\dfrac{dx^T}{dx}$ equals $1^T$ or $I$ it does not give the expected result.

loct
  • 237
  • 1
    The link is useful, but it is not entirely the same as my question. I still have no idea what $\dfrac{dx^T}{dx}$ is (in numerator-layout notation as I persist). – loct Feb 20 '21 at 12:07
  • I believe it is unwise to use the same notation for scalar derivatives, gradients and Jacobians. I prefer to compute things via directional derivatives, rather than resorting to extremely terse cookbooks. – Rodrigo de Azevedo Feb 20 '21 at 12:10
  • If that is the case, what do you think $\dfrac{dx^T}{dx}$ is? – loct Feb 20 '21 at 12:14
  • Frankly, I don't. And I prefer it that way. I go via directional derivatives and scalar products and avoid such weird things. This is how I think about it. – Rodrigo de Azevedo Feb 20 '21 at 12:15

2 Answers2

1
  • Let's use the numerator-layout notation. First note that $\frac{dx}{dx}=I$ but $\frac{dx^T}{dx}=\begin{bmatrix}\begin{pmatrix}1&0&...&0\end{pmatrix},\begin{pmatrix}0&1&0&...&0\end{pmatrix},...,\begin{pmatrix}0&0&...&0&1\end{pmatrix}\end{bmatrix}$, a tensor, technically 1 x n x n. In denominator layout fashion, $\frac{dx^T}{dx}=\left[\begin{pmatrix}1\\0\\...\\0\end{pmatrix},\begin{pmatrix}0\\1\\0\\...\\0\end{pmatrix},...,\begin{pmatrix}0\\...\\0\\1\end{pmatrix}\right]$, a n x 1 x n tensor. It is possible to imagine it as a 3D matrix with the entries behind one another rather than listed liked this.

  • The inner product is symmetric, e.g. $x^Ty=y^Tx=\langle x, y\rangle$. We have the following four scenarios directly applying the derivative to $x^Tx$ and $x^TAx$:

$$\begin{matrix}&\text{denominator layout}&\text{numerator layout}\\ \frac{d}{dx}x^Tx&\frac{dx}{dx}x+\frac{dx}{dx}x=2x&x^T\frac{dx}{dx}+x^T\frac{dx}{dx}=2x^T\\ \frac{d}{dx}x^TAx&\frac{dAx}{dx}x+\frac{dx}{dx}Ax=(A^T+A)x&x^T\frac{dAx}{dx}+x^TA^T\frac{dx}{dx}x^T(A+A^T)\end{matrix}$$

Therefore the rule is $\frac{d}{dx}\langle x, y\rangle=\frac{dx}{dx}y+\frac{dy}{dx}x$ in denominator layout and $\frac{d}{dx}\langle x, y\rangle=x^T\frac{dy}{dx}+y^T\frac{dx}{dx}$ in numerator layout. Multiplying through by $dx$ suggests that $d\langle x, y\rangle=(dx)y+(dy)x$ in denominator layout but $d\langle x, y\rangle=x^T(dy)+y^T(dx)$ in numerator layout. Therefore it is doubtful that $d\langle x, y\rangle$ is a scalar and you could freely take the transpose.

Conclusion

You should directly use the product rule.

Vons
  • 11,004
  • Seems we are not using the same notation, to me $\dfrac{dAx}{dx} = A$ rather than $A^T$ – loct Feb 20 '21 at 04:50
  • @Stacker Yes matter the notation. See Matrix calculus – Jackozee Hakkiuz Feb 20 '21 at 07:08
  • @JackozeeHakkiuz Thx for the link. – Vons Feb 20 '21 at 17:53
  • @loct I included both how to take the derivative for denominator layout and numerator layout. Suggests that it would be strange to take the transpose of $d\langle x, y\rangle$ – Vons Feb 20 '21 at 17:55
  • @Stacker For now I have reservations about $\dfrac{dx^T}{dx}$ , because it seems to me people have not agreed on its uses (or if it makes sense), but I will take your idea as a reference. Thank you. – loct Feb 21 '21 at 01:26
1

These rules pertain to differentials not to gradients.

Let's use them properly, starting with your second example function. $$\eqalign{ f_2 &= x^TAx \\ df_2 &= dx^TAx+x^TA\,dx \\ &= (Ax)^Tdx+(A^Tx)^Tdx \\ &= (Ax+A^Tx)^Tdx \\ \frac{\partial f_2}{\partial x} &= (Ax+A^Tx) \\ }$$ Setting $A=I$ turns this into your first function. Therefore $$\eqalign{ \frac{\partial f_1}{\partial x} &= (Ix+I^Tx) \;=\; 2x \\\\ }$$ There are no corresponding rules for gradients, because a gradient operation changes a vector into a matrix, and matrix multiplication is not commutative. Trying to apply the rules to gradients produces nonsense, as you have discovered.

greg
  • 35,825
  • You see through the core problem in my derivation. Though, I think there is a little mistake in your final answer, that the transpose is missed. – loct Feb 20 '21 at 12:04
  • Given the relationship $,df = g^Tdx;$ the term gradient refers to the column vector $g$. Some people use the opposite layout convention and insist that the row vector $g^T$ is the gradient, but that choice is inconsistent and problematic. – greg Feb 20 '21 at 13:27
  • Do I understand correctly that the "opposite layout convention" you are talking is the numerator-layout convention? I am confused: when f is $\mathbb{R}^n\to\mathbb{R}^m$, the derivative(Jacobian) is m by n. When f is $\mathbb{R}^n\to\mathbb{R}$ , the derivative is n by 1? – loct Feb 20 '21 at 14:48
  • The convention nonsense is a shortcoming of the matrix notation used by mathematicians. Engineers and physicists use explicit dot products (or index notation) which makes differentiation utterly trivial. For a scalar function $$\eqalign{ \phi&=g\cdot x&\implies d\phi=g\cdot dx\ \phi&=g_kx_k&\implies d\phi=g_kdx_k }$$ and for a vector function $$\eqalign{ y&=A\cdot x&\implies dy=A\cdot dx \ y_n&=A_{nk}x_k&\implies dy_n=A_{nk}dx_k \ }$$ – greg Feb 20 '21 at 15:57