1

I am trying to learn how to replicate the matrix calculus done in the following paper: https://arxiv.org/pdf/1811.11433.pdf. To learn how to do this, I a using the following book I found (https://www.mobt3ath.com/uplode/book/book-33765.pdf), by Karim Abadir and Jan Magnus.

I attempted to start by find the differential of function H given below. However, it does not look like I am on the right track. Can someone tell me if my calculations below are correct so far? Or at least if I am using the correct book to be able to understand the paper I listed? I noticed that the book uses the 'vec' operator to treat the Hessian of a matrix function as a matrix while the paper uses an order 4 tensor, so I am not sure if I am using the right approach. Thanks for the help.

My work so far:

Let $H(B)=\log\det BCB^T$ where $B$ and $C$ are square matrices of dimension $n$ and $C$ is symmetric. Let $F(B)=BCB^T$ and $G(R)=\log\det R$ so that $H(B)=G(F(B))$.

\begin{align*} dF &= d(B)CB^T + BCd(B^T) \hspace{0.4cm} dG(R) = Tr[R^{-1} dR] \\ \\ dH &= Tr[(BCB^T)^{-1} (d(B)CB^T + BCd(B^T))] \textbf{ Take transpose}\\ &= Tr[(BCd(B)^T+d(B)CB^T)(BCB^T)^{-1}] \\ &=Tr[BCd(B)^T(BCB^T)^{-1}] + Tr[(d(B)CB^T(BCB^T)^{-1}] \\ &=Tr[BCd(B)^T(B^T)^{-1}C^{-1}B^{-1}] + Tr[(d(B)CB^T(B^T)^{-1}C^{-1}B^{-1}] \textbf{ Use cyclic property}\\ &= Tr[(B^T)^{-1} d(B)^T] + Tr[B^{-1} d(B)] = 2* Tr[B^{-1}d(B)] \end{align*}

The corresponding total derivative is then $DH=2*(vec (B^{-1}))^T$ by the book's notation. Then I assume I would just 'unvectorize' this to get the derivative in the paper's notation? Is this a good start to calculating the gradient of the loss function in the paper I listed. Thanks.

K.defaoite
  • 12,536
xedg
  • 13

1 Answers1

1

First, calculate the gradient for the full matrix. $$\eqalign{ X &= BCB^T = X^T \\ \phi &= \log\det X \\ d\phi &= X^{-T}:dX \\ &= X^{-1}:2\operatorname{sym}(dB\,CB^T) \\ &= 2X^{-1}BC:dB \\ \frac{\partial\phi}{\partial B} &= 2X^{-1}BC \\ }$$ Repeat the calculation for the diagonalized matrix. $$\eqalign{ Y &= (I\odot X) = Y^T \\ \psi &= \log\det(Y) \\ d\psi &= 2Y^{-1}BC:dB \\ \frac{\partial\psi}{\partial B} &= 2Y^{-1}BC \\ }$$ The Pham cost function is a linear combination of these functions. $$\eqalign{ {\cal L} &= \frac{\psi - \phi}{2} \\ \frac{\partial{\cal L}}{\partial B} &= \Big(Y^{-1}-X^{-1}\Big)BC \;\doteq\; G_{std} \qquad&\big({\rm standard\;gradient}\big) \\\\ }$$ However, rather than the standard gradient, the linked paper utilizes the relative gradient, which is defined in terms of a small perturbation matrix $(E)$. $$\eqalign{ d{\cal L} &= {\cal L}(B+EB) - {\cal L}(B) \\ &= G_{std}:EB \\ &= G_{std}B^T:E \\ &= G:E \\ \\ G &= \Big(Y^{-1}-X^{-1}\Big)BCB^T \\ &= \Big(Y^{-1}-X^{-1}\Big)X \\ &= (Y^{-1}X-I) \\ }$$ This is the content of the first part of Eq (3) on the second page, except it is written in component form, i.e. $$\eqalign{ G_{ab} &= \frac{X_{ab}}{X_{aa}} - \delta_{ab} \\\\ }$$


NB:   The paper uses bra-ket notation for the Frobenius product, whereas I use a colon, e.g. $$A:B = \langle A|B\rangle = {\rm Tr}(A^TB)$$ because it's a lot easier to type (and it looks better).

The Kronecker-vec operation can flatten a matrix expression into a vector $${\rm vec}(AXB)=(B^T\otimes A){\rm vec}(X) \;=\; Mx$$ Using the vec operation, a gradient matrix can be flattened to a long vector $$\eqalign{ \frac{\partial\phi}{\partial X} &= G \quad&({\rm matrix}) \\ d\phi &= G:dX \\ &= {\rm vec}(G)&:{\rm vec}(dX) \\ &= g:dx \\ \frac{\partial\phi}{\partial x} &= g \quad&({\rm vector}) \\ \\ G,X &\in{\mathbb R}^{m\times n} \\ g,x &\in {\mathbb R}^{mn\times 1} \\ }$$ Similarly, a 4th order Hessian tensor can be flattened into a large matrix $$\eqalign{ {\cal H} &= \frac{\partial G}{\partial X} \in{\mathbb R}^{m\times n\times m\times n} \quad&({\rm tensor}) \\ H &= \frac{\partial g}{\partial x} \in {\mathbb R}^{mn\times mn} \quad&({\rm matrix}) \\ }$$
greg
  • 35,825
  • Thank you! This is very helpful. I would like to now attempt computing the Hessian given in the paper. However, I noticed the paper gives the Hessian as an order 4 tensor, but the book I mentioned above treats a matrix function Hessian as a bigger matrix. Since you are familiar with matrix calculus, could you recommend a book to learn a more standard approach to matrix calculus, or at least tell me what to search to find one? Your approach seems to have some noticeable differences from the book's formalism. – xedg Jul 23 '20 at 17:52
  • Also could you mention what your 'sym' operator means? – xedg Jul 23 '20 at 17:54
  • $${\rm sym}(X) = \frac{X+X^T}{2}$$ – greg Jul 23 '20 at 18:25
  • @xedg My approach corresponds to Chapter 13 of the book. In my experience, matrix calculus is best done using differentials. – greg Jul 24 '20 at 16:57