Change from differentiation wrt to matrix to wrt to inverse of matrix for symmetric matrices

Question

For the rule below:

$$ \frac{\partial J}{\partial \mathbf{A}}= -\mathbf{A}^{-T} \frac{\partial J}{\partial \mathbf{W}} \mathbf{A}^{-T} $$

where $\mathbf{A}$ is an invertible square matrix, $\mathbf{W}$ is the inverse of $\mathbf{A}$, and J is a function (see end of section 2.2 in matrix cookbook https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf)

Does this rule hold if $\mathbf{A}$ is a symmetric matrix?

The rule is valid for any invertible matrix. For your conerns about symmetry, read this post. — greg, Jan 29 '21 at 16:20

score 2 · Accepted Answer · answered Jan 29 '21 at 15:30

2

If $A$ is symmetric but not invertible, the rule won't hold as the inverse of $A$ is not even defined.

And if $A$ is symmetric invertible... then it is invertible and the formula holds as for any invertible matrix.

answered Jan 29 '21 at 15:30

mathcounterexamples.net

70,018

My only worry was that when the matrix cookbook introduces derivatives of structured matrices (including symmetric matrices) it states that results stated before (including the above rule) do not hold in general if the matrix is structured. – mes Jan 29 '21 at 15:38

greg · Answer 2 · 2021-06-23T03:21:12.873

$ \def\b{\bullet} \def\e{\varepsilon} \def\m#1{\left[\begin{array}{c}#1\end{array}\right]} \def\p#1#2{\frac{\partial #1}{\partial #2}} $I really like The Matrix Cookbook but the section on structured matrices is not very good, so here's a different approach to the subject.

Given a vector of parameters $\{p\}$ and matrix basis $\{B_i\}$ $$\eqalign{ p &= \m{\alpha \\ \beta},\qquad B_1 = \m{1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0},\qquad B_2 = \m{0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0} \\ }$$ create a structured matrix $\{A\}$ and cost function $\{\phi\}$ $$\eqalign{ A &= \sum_{i=1}^2\;p_iB_i \;=\; \m{\alpha & \alpha & 0 & 0 \\ 0 & \beta & 0 & 0 \\ 0 & \beta & \beta & 0},\qquad &\phi = \tfrac 12\Big\|AX-Y\Big\|_F^2 \\ }$$ Note that $(\alpha,\beta)$ are the only independent variables in the entire problem.

When $A$ is unconstrained it's easy to calculate the gradient/differential of the cost $$\eqalign{ G = \p{\phi}{A} = (AX-Y)X^T \quad\implies\quad d\phi = G\b dA \\ }$$ where the bullet denotes the matrix inner product, i.e. $$\eqalign{ G\b dA &= \sum_{i=1}^3\sum_{j=1}^4 G_{ij}\;dA_{ij} \;=\; {\rm Tr}(G^TdA) \\ }$$ Because of the structure which was imposed on $A$, its differential is also structured $$dA = \sum_{i=1}^2 B_i\,dp_i$$ Substituting this expression leads to the parametric gradient $$\eqalign{ d\phi &= \sum_{i=1}^2\;G\b(B_i\,dp_i) = \sum_{i=1}^2\left(\p{\phi}{p_i}\right)dp_i \quad\implies\quad \p{\phi}{p_i} = G\b B_i \\ }$$ At this point, one would do all further calculations in terms of the $p$-vector.

Now comes the weird part...

Every basis $\{B_i\}$ has a dual basis $\{B_i^\delta\}$ which spans the same subspace $\cal S$, but is orthonormal with respect to the inner product $$B_i\b B_j^\delta \;=\; \delta_{ij}$$ Some bases are self-dual, such as the canonical vector basis $\{\e_i\}$, but in general determining the dual basis requires a pseudoinverse calculation $$\eqalign{ &\;b_k = {\rm vec}(B_k) \qquad &\;b_k^\delta = {\rm vec}(B_k^\delta) \\ &\m{b_1 & b_2 &\ldots & b_p}^+ = &\m{b_1^\delta & b_2^\delta &\ldots & b_p^\delta}^T \\ }$$ In the vector case, the gradient with respect to the $p$-vector can be written as the sum of each component multiplied by the corresponding vector from the dual basis, i.e. $$\eqalign{ \p{\phi}{p} &= \sum_{i=1}^2 \left(\p{\phi}{p_i}\right)\e_i \\ }$$ Many authors extend this idea and define the structured gradient as the matrix $$\eqalign{ \left(\p{\phi}{A}\right)_S &= \sum_{i=1}^2\left( \p{\phi}{p_i} \right) B_i^\delta \\ &= \sum_{i=1}^2\left(G\b B_i\right) B_i^\delta \\ &= G\b\left(\sum_{i=1}^2 B_i B_i^\delta \right) \\ &= G\b{\cal B} \\ }$$ where $\cal B$ is a fourth-order tensor with components $${\cal B}_{jk\ell m} = \sum_{i=1}^2\;\left(B_i\right)_{jk}\,\left(B_i^\delta\right)_{\ell m} \\$$ The $\cal B$ tensor is a projector into the subspace $\big(\,{\cal B}\b X\in{\cal S}\;\;{\rm for}\;X\in{\mathbb R}^{3\times 4}\big)$ where it also acts as an identity tensor for the subspace $\big({\cal B}\b M=M\b{\cal B} = M\;\;{\rm for}\;M\in{\cal S}\big)$ .

If the basis spans the whole space $\,{\cal S}\equiv{\mathbb R}^{3\times 4}\,$ then $\cal B$ becomes the true identity tensor $\cal I$, and the structured gradient is identical to the full unstructured gradient $G$ (as expected). $$\eqalign{ {\cal B}_{jk\ell m} \;&\to\; {\cal I}_{jk\ell m} = \delta_{j\ell}\delta_{km} \\ (G\b{\cal B}) \;&\to\; (G\b{\cal I}) = G \\ }$$

As a concrete example, let's examine a symmetrically constrained $2\times 2$ matrix. $$\eqalign{ p &= \m{\alpha \\ \beta \\ \lambda},\qquad B_1 = \m{1 & 0 \\ 0 & 0},\qquad B_2 = \m{0 & 0 \\ 0 & 1},\qquad B_3 = \m{0 & 1 \\ 1 & 0} \\ A &= \m{\alpha & \lambda \\ \lambda & \beta} \quad=\quad \alpha B_1 + \beta B_2 + \lambda B_3,\qquad B_k^\delta = \frac{B_k}{B_k\b B_k} \\ }$$ The structured gradient calculation then goes as follows $$\eqalign{ \left(\p{\phi}{A}\right)_S &= \frac{(G\b B_1)B_1}{B_1\b B_1} + \frac{(G\b B_2)B_2}{B_2\b B_2} + \frac{(G\b B_3)B_3}{B_3\b B_3} \\ &= G_{11}\,B_1 +G_{22}\,B_2 +\tfrac 12(G_{12}+G_{21})\,B_3 \\ &= \m{G_{11} & \tfrac 12(G_{12}+G_{21}) \\ \tfrac 12(G_{12}+G_{21}) & G_{22}} \\ &= \left(\frac{G+G^T}{2}\right) \;\doteq\; {\rm Sym}(G) \\ }$$ But The Matrix Cookbook uses the regular basis instead of the dual basis which results in the following miscalculation $$\eqalign{ \left(\p{\phi}{A}\right)_{S^*} &= \left(G\b B_1\right)B_1 + \left(G\b B_2\right)B_2 + \left(G\b B_3\right)B_3 \\ &= G_{11}\,B_1 +G_{22}\,B_2 +(G_{12}+G_{21})\,B_3 \\ &= \m{G_{11} & (G_{12}+G_{21}) \\ (G_{12}+G_{21}) & G_{22}} \\ &= G+G^T-{\rm Diag}(G) \\ }$$

The skew-symmetric case is similar but is seldom mentioned.
There is only one parameter and one matrix in the basis $$\eqalign{ p &= \m{\alpha},\qquad B = \m{0 & 1 \\ -1 & 0},\qquad B^\delta = \frac{B}{B\b B} \\ A &= \m{0 & \alpha \\ -\alpha & 0} \;\;=\;\; \alpha B \\ }$$ and the structured gradient is $$\eqalign{ \left(\p{\phi}{A}\right)_S &= \frac{(G\b B)B}{B\b B} \\ &= \tfrac 12(G_{12}-G_{21})\,B \\ &= \m{0 & \tfrac 12(G_{12}-G_{21}) \\ \tfrac 12(G_{21}-G_{12}) & 0} \\ &= \left(\frac{G-G^T}{2}\right) \;\doteq\; {\rm Skew}(G) \\ }$$ If you use $B$ instead of $B^\delta$ in this case, the gradient has the right direction but the wrong length, i.e. $$\eqalign{ \left(\p{\phi}{A}\right)_{S^*} &= \left(G-G^T\right) \;=\; 2\;{\rm Skew}(G) \\ }$$

Change from differentiation wrt to matrix to wrt to inverse of matrix for symmetric matrices

2 Answers2

Linked