The initial problem was the following: $\mathbf A = (a_{ij})_{1\leq i,j \leq n}$ an arbitrary square matrix with complex entries and $f(z) = \sum_{m=0}^\infty b_m z^m$ an entire function. Then $$\frac\partial{\partial a_{ij}} \mathrm{tr}\ f(\mathbf A) = \big(f'(\mathbf A)\big)_{ji}.$$
Using e.g. Notions of Matrix Differentiation, Differential and derivative of the trace of a matrix and Derivative of the trace of matrix product $(X^TX)^p$ , I tried to understand the notions of derivatives of a matrix. So I started with: $$\frac\partial{\partial \mathbf A} \mathrm{tr}\ \mathbf A^p = p\big(\mathbf A^T\big)^{p-1} \tag{$*$}$$ But there seems to be different notions. At least, I found two notions to correlate:
Let $\mathbf A$ $m \times n$ matrix, then $\mathrm{vec}\ \mathbf A = \begin{pmatrix} \mathbf a_1\\ \vdots \\ \mathbf a_n\end{pmatrix}$ is a $mn\times 1$ column vector. And we use the Fréchet-differentiability $$f(x+h) = f(x) + \mathrm Df(x)h + r_x(h),$$ where $\mathrm Df(x)$ is the differential and $\mathrm d f(x,h) = \mathrm Df(x)h = \langle \nabla f(x), h\rangle$ and $\mathrm Df(x)^T = \nabla f(x)$ the gradient. So the differential makes sense if the original function is defined on a circle $B(x,r)$ around $x$ with radius r, and $x + h \in B(x,r)$. Then the differential is somewhat $$\mathrm Df(\mathbf A) = \frac{\partial f(\mathbf A)}{\partial(\mathrm{vec}\ \mathbf A)^T}.$$ Then the differential is linear and obeys the product rule. Since the trace is linear, we get $\mathrm d \ \mathrm{tr}\ f = \mathrm{tr}(\mathrm df)$, where $$\mathrm{tr}(\mathbf A^T \mathbf B) = \sum_{j=1}^n\sum_{i=1}^n a_{ij}b_{ij} = (\mathrm{vec}\ \mathbf A)^T \mathrm{vec}\ \mathbf B.$$
- Can we conclude therefore $\mathrm d \ \mathrm{tr}\ f(\mathbf A) = \mathrm{tr}(f'(\mathbf A) \ \mathrm d\mathbf A)$ as $\mathrm d f(\mathbf A) = f'(A)\mathrm \ \mathrm d\mathbf A$ from the formalism? If we simply use this formula, why do we need the transpose $\mathbf A^T$ of $\mathbf A$ in ($*$)?
- How does the notation in 1. (found at Notions of Matrix Differentiation) corresponds to the notation I used?
Using the formalism from above we can show that $\mathrm D\mathrm tr \mathbf A^p = p \ \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T$, since $$\begin{align} \mathrm d\ \mathrm tr \mathbf A^p &= \mathrm tr \ \mathrm d \mathbf A^p\\ &= \mathrm{tr} \big( (\mathrm d \mathbf A)\mathbf A^{p-1} + \mathbf A(\mathrm d\mathbf A)\mathbf A^{p−2}+ \dots + \mathbf A^{p−1}(\mathrm d\mathbf A)\big)\\ &= \text{linearity and cyclic permutation}\\ &= p \ \mathrm{tr} \mathbf A^{p−1}(\mathrm d \mathbf A)\\ &= p \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T \mathrm d \mathrm{vec}\ \mathbf A \end{align}$$ Thus we have $$\begin{align} \mathrm d \ \mathrm tr \mathbf A^p &= p \ \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T \mathrm d \mathrm{vec}\ \mathbf A\\ \mathrm D\ \mathrm tr \mathbf A^p &= p \ \big(\mathrm{vec}(\mathbf A^T)^{p-1}\big)^T \end{align}$$
Now an easy example: Let $$\mathbf A = \begin{pmatrix} x & z\\ z & y\end{pmatrix} \qquad \mathbf B = \begin{pmatrix} x & v\\ w & y\end{pmatrix},$$ then $$\mathbf A^2 = \begin{pmatrix} x^2+z^2 & \\ & y^2+z^2\end{pmatrix} \qquad \mathbf B^2 = \begin{pmatrix} x^2+vw & \\ & y^2+vw\end{pmatrix},$$ $$\mathrm{tr}\ \mathbf A^2 = x^2+y^2+2z^2 \qquad \mathrm{tr}\ \mathbf B^2 = x^2+y^2+2vw,$$ but hence $$\frac\partial{\partial \mathbf A}\mathrm{tr}\ \mathbf A^2 = \begin{pmatrix} 2x & 4z\\ 4z & 2y\end{pmatrix} \neq 2(\mathbf A^T)^{2-1} \qquad \frac\partial{\partial \mathbf B}\mathrm{tr}\ \mathbf B^2 = \begin{pmatrix} 2x & 2w\\ 2v & 2y\end{pmatrix} = 2(\mathbf B^T)^{2-1}.$$
- Where is the problem? Since the formula should hold for any square matrix.
- Can the initial problem be solved using Einstein/index notation?
- Can the initial problem be solved by using that $$\mathrm{tr} \mathbf A^p = \sum_{i_1,...,i_p=1}^n a_{i_1i_2}...a_{i_{p-1}i_p}a_{i_pi_1}?$$