Understanding notation of derivatives of a matrix

Question

The initial problem was the following: $\mathbf A = (a_{ij})_{1\leq i,j \leq n}$ an arbitrary square matrix with complex entries and $f(z) = \sum_{m=0}^\infty b_m z^m$ an entire function. Then $$\frac\partial{\partial a_{ij}} \mathrm{tr}\ f(\mathbf A) = \big(f'(\mathbf A)\big)_{ji}.$$

Using e.g. Notions of Matrix Differentiation, Differential and derivative of the trace of a matrix and Derivative of the trace of matrix product $(X^TX)^p$ , I tried to understand the notions of derivatives of a matrix. So I started with: $$\frac\partial{\partial \mathbf A} \mathrm{tr}\ \mathbf A^p = p\big(\mathbf A^T\big)^{p-1} \tag{$*$}$$ But there seems to be different notions. At least, I found two notions to correlate:

Let $\mathbf A$ $m \times n$ matrix, then $\mathrm{vec}\ \mathbf A = \begin{pmatrix} \mathbf a_1\\ \vdots \\ \mathbf a_n\end{pmatrix}$ is a $mn\times 1$ column vector. And we use the Fréchet-differentiability $$f(x+h) = f(x) + \mathrm Df(x)h + r_x(h),$$ where $\mathrm Df(x)$ is the differential and $\mathrm d f(x,h) = \mathrm Df(x)h = \langle \nabla f(x), h\rangle$ and $\mathrm Df(x)^T = \nabla f(x)$ the gradient. So the differential makes sense if the original function is defined on a circle $B(x,r)$ around $x$ with radius r, and $x + h \in B(x,r)$. Then the differential is somewhat $$\mathrm Df(\mathbf A) = \frac{\partial f(\mathbf A)}{\partial(\mathrm{vec}\ \mathbf A)^T}.$$ Then the differential is linear and obeys the product rule. Since the trace is linear, we get $\mathrm d \ \mathrm{tr}\ f = \mathrm{tr}(\mathrm df)$, where $$\mathrm{tr}(\mathbf A^T \mathbf B) = \sum_{j=1}^n\sum_{i=1}^n a_{ij}b_{ij} = (\mathrm{vec}\ \mathbf A)^T \mathrm{vec}\ \mathbf B.$$

Can we conclude therefore $\mathrm d \ \mathrm{tr}\ f(\mathbf A) = \mathrm{tr}(f'(\mathbf A) \ \mathrm d\mathbf A)$ as $\mathrm d f(\mathbf A) = f'(A)\mathrm \ \mathrm d\mathbf A$ from the formalism? If we simply use this formula, why do we need the transpose $\mathbf A^T$ of $\mathbf A$ in ($*$)?

How does the notation in 1. (found at Notions of Matrix Differentiation) corresponds to the notation I used?

Using the formalism from above we can show that $\mathrm D\mathrm tr \mathbf A^p = p \ \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T$, since $$\begin{align} \mathrm d\ \mathrm tr \mathbf A^p &= \mathrm tr \ \mathrm d \mathbf A^p\\ &= \mathrm{tr} \big( (\mathrm d \mathbf A)\mathbf A^{p-1} + \mathbf A(\mathrm d\mathbf A)\mathbf A^{p−2}+ \dots + \mathbf A^{p−1}(\mathrm d\mathbf A)\big)\\ &= \text{linearity and cyclic permutation}\\ &= p \ \mathrm{tr} \mathbf A^{p−1}(\mathrm d \mathbf A)\\ &= p \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T \mathrm d \mathrm{vec}\ \mathbf A \end{align}$$ Thus we have $$\begin{align} \mathrm d \ \mathrm tr \mathbf A^p &= p \ \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T \mathrm d \mathrm{vec}\ \mathbf A\\ \mathrm D\ \mathrm tr \mathbf A^p &= p \ \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T \end{align}$$

Now an easy example: Let $$\mathbf A = \begin{pmatrix} x & z\\ z & y\end{pmatrix} \qquad \mathbf B = \begin{pmatrix} x & v\\ w & y\end{pmatrix},$$ then $$\mathbf A^2 = \begin{pmatrix} x^2+z^2 & \\ & y^2+z^2\end{pmatrix} \qquad \mathbf B^2 = \begin{pmatrix} x^2+vw & \\ & y^2+vw\end{pmatrix},$$ $$\mathrm{tr}\ \mathbf A^2 = x^2+y^2+2z^2 \qquad \mathrm{tr}\ \mathbf B^2 = x^2+y^2+2vw,$$ but hence $$\frac\partial{\partial \mathbf A}\mathrm{tr}\ \mathbf A^2 = \begin{pmatrix} 2x & 4z\\ 4z & 2y\end{pmatrix} \neq 2(\mathbf A^T)^{2-1} \qquad \frac\partial{\partial \mathbf B}\mathrm{tr}\ \mathbf B^2 = \begin{pmatrix} 2x & 2w\\ 2v & 2y\end{pmatrix} = 2(\mathbf B^T)^{2-1}.$$

Where is the problem? Since the formula should hold for any square matrix.

Can the initial problem be solved using Einstein/index notation?

Can the initial problem be solved by using that $$\mathrm{tr} \mathbf A^p = \sum_{i_1,...,i_p=1}^n a_{i_1i_2}...a_{i_{p-1}i_p}a_{i_pi_1}?$$

I have written an explanation elsewhere regarding this. https://math.stackexchange.com/a/4520705/66158 — me10240, Aug 29 '22 at 03:22

greg · Accepted Answer · 2020-07-28T19:44:11.063

Congratulations, you've discovered something very subtle about matrix calculus! In section 2.8 of the Matrix Cookbook, there is a discussion of "Structured Matrices" which addresses situations like this.

Let $G$ denote the gradient as calculated by the trace formula, without regard to any special structure that the matrix may have. Now you wish to enforce a symmetry constraint.

The recipe for the constrained gradient in this case is $$\eqalign{ S &= G + G^T - I\circ G \cr }$$ where $(\circ)$ denotes the Hadamard (aka elementwise) product.

Note however that you should still use $G$, and not $S$, to calculate the differential of the function $$\eqalign{ df &= \sum_{i,j} G_{ij}\,dA_{ij} \neq \sum_{i,j} S_{ij}\,dA_{ij} \cr }$$ because the off-diagonal terms will be counted twice by a naive summation.

If you want to use $S$ to calculate the differential, then you must do the sum more carefully $$\eqalign{ df &= \sum_{i\geq j} S_{ij}\,dA_{ij} \cr }$$

Update

The paper linked by Albus in the comments proves a very interesting identity.
Any matrix, whether symmetric or not, satisfies the following $$\eqalign{ {\rm vech}\big(X+X^T-I\circ X\big) &= D^T {\rm vec}(X) \\ }$$ where $D$ is the Duplication matrix, which was originally defined to recover the full vectorization of a symmetric matrix from its half-vectorized form $$\eqalign{ {\rm vec}(A) &= D\;{\rm vech}(A) \\ }$$ Using these results, we have three ways of writing the differential of a function. $$\eqalign{ df &= G:dA \qquad&\big({\rm Matrix\,form}\big) \\ &= {\rm vec}(G):{\rm vec}(dA) \qquad&\big({\rm Vec\,form}\big) \\ &= {\rm vech}(S):{\rm vech}(dA) \qquad&\big({\rm Half\,vec\,form}\big) \\ }$$ The last expression is only valid when $A=A^T,\,$ the others are valid for all matrices.

The derivatives, with respect to the vector of fully independent components, can be calculated in the form of a half-vec, and then reshaped into a matrix. $$\eqalign{ g_{s} &= \frac{\partial f}{\partial {\rm vech}(A)} = {\rm vech}(S) \\ S &= {\rm vech}^{-1}\big(g_{s}\big) \\ }$$ The question comes down to terminology $-$ in what sense can $S$ be called the gradient.
It certainly behaves like a gradient in the half-vec space.

NB: The colon product used above is defined as $$A:B = {\rm Tr}(A^TB) = {\rm Tr}(AB^T)$$ and is applicable to vectors as well as matrices.

Update #2

This update is to answer another question raised in the comments:

Given a function $f=f(A)$ what is the "best" way to calculate the gradient?

IMHO, the best way to perform such analysis is to introduce an unconstrained matrix $X$ and use it to construct the matrix $A$ so as to satisfy any constraints.

For example, the construction for an SPD constraint might be $A = XX^T$
in which case the gradient calculation would be $$\eqalign{ df &= G_a:dA \\ &= G_a:\big(dX\,X^T+X\,dX^T\big) \\ &= \big(G_a+G_a^T\big)\,X:dX \\ G_x = \frac{\partial f}{\partial X} &= \big(G_a+G_a^T\big)\,X \\ }$$ where $G_a$ is a well-known gradient for an arbitrary matrix from a trusted reference.

But now $G_x$ is a gradient which you can use to calculate (via gradient descent, conjugate gradients, etc) a solution to your problem $X=X_s\,$ after which the corresponding constrained matrix can be constructed as $\,A_s = X_s X_s^T$

Some other useful constructions are $$\eqalign{ A &= I\circ X \qquad&\big(A{\rm \;is\,diagonal}) \\ A &= P\circ X \qquad&\big(A{\rm \;is\,patterned}) \\ A &= X-X^T \qquad&\big(A{\rm \;is\,skew\,symmetric}) \\ A &= \left(\frac{2I+X-X^T}{2I-X+X^T}\right) \qquad&\big(A{\rm \;is\,orthogonal}) \\ }$$ In the case of a symmetric constraint, you can use the obvious construction $$A=\tfrac{1}{2}(X+X^T) \;\doteq\; {\rm sym}(X)$$ and calculate the gradient as $$\eqalign{ df &= G_a:dA \\ &= G_a:{\rm sym}(dX) \\ &= {\rm sym}(G_a):dX \\ G_x = \frac{\partial f}{\partial X} &= \tfrac{1}{2}\big(G_a+G_a^T\big) \\ }$$ and this is precisely the result of Panda et al.

Now consider an alternate construction base on the unconstrained vector $$x = {\rm vech}(A) \quad\iff\quad A={\rm vech}^{-1}(x)$$ whose gradient calculation is $$\eqalign{ df &= G:dA \\ &= {\rm vec}(G):{\rm vec}(dA) \\ &= {\rm vec}(G):D\,dx \\ &= D^T{\rm vec}(G):dx \\ &= {\rm vech}(G+G^T-I\circ G):dx \\ &= {\rm vech}(S):dx \\ g_x = \frac{\partial f}{\partial x} &= {\rm vech}(S) \\ &= E\;{\rm vec}(S) \\ &= E\,(g+Kg-{\rm vec}(I)\circ g) \\ &= E(I+K-Y)\,g \\ G_x &= {\rm vech}^{-1}(g_x) \\ }$$ where $(D,E,K)$ are the (duplication, elimination, commutation) matrices associated with Kronecker products, $\,g={\rm vec}(G),\,$ and $\,Y={\rm Diag}\big({\rm vec}(I)\big).$

This is the gradient that other authors have in mind. Although they shouldn't write it as a matrix. Instead they should work with the underlying unconstrained $g_x$ vector.

It seems that the gradient may be different, as proved in this new paper https://arxiv.org/pdf/1911.06491.pdf — yes, Jul 27 '20 at 16:37
@Albus I updated the answer using a formula from the paper, but I'm not sure it's an interpretation that its authors would endorse. — greg, Jul 27 '20 at 21:50
The authors claim that the correct gradient is $G+G'/2$ though. Moreover, according to you, if I am interested in finding the gradient of the function, not the differential, what is the "best" way? Try to perturb the function and see what varies linear in the perturbation (and then use the formula for the constrained gradient) or another way? — yes, Jul 28 '20 at 14:12
@Albus I updated the answer again. In my opinion, differentials provide the most fool-proof approach to matrix calculus, and is entirely equivalent to first-order perturbation theory. — greg, Jul 28 '20 at 18:13
clear. Suppose that now I have both the gradient of a function and the relative first-order differential. I want to find the second order differential (and the Hessian). Should I perturb now the gradient or the differential? — yes, Jul 29 '20 at 14:26
I will be more clear. I have the expression for the gradient and for the differential of a function ( precisely $f(X)=logdet(X)$), and we know that the gradient contained in the differential is different from the "real" constrained gradient that one would use for gradient descent for instance. Now I want to find the Hessian. Is it correct to perturb the first-order differential (with the unconstrained gradient) ? And in such a case I will find a scalar quadratic form, but then to extrapolate the "real" constrained Hessian I may need the same reasoning that we are now applying to the gradient — yes, Jul 29 '20 at 15:10
I'd be happy to answer this as a new question. Comments are not meant for prolonged discussions. — greg, Jul 29 '20 at 15:35
Ok. I will create a new question today or tomorrow. I will let you know here when I complete it. — yes, Jul 29 '20 at 15:40
Here we are, I created a new question at this link: https://math.stackexchange.com/questions/3773532/hessian-of-fx-when-x-is-a-symmetric-matrix Thank you. — yes, Jul 29 '20 at 16:21

Understanding notation of derivatives of a matrix

1 Answers1

Update

Update #2

Linked