2

I'm trying to compute the beta-smoothness of the log-likelihood for Gaussian Discriminant Analysis, with respect to the covariance matrix $\Sigma$.

We have
$$ \ell(\phi, \boldsymbol \mu_0, \boldsymbol \mu_1, \boldsymbol \Sigma) = \sum_{i=1}^m \left( y^{(i)} \log \phi + (1 - y^{(i)}) \log (1 - \phi) -\frac{1}{2} \log |\boldsymbol \Sigma|+ C + (\textbf{x}^{(i)}-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1} (\textbf{x}^{(i)} - \boldsymbol \mu) \right) $$

I used this StackExchange answer to deduce that $\nabla_A \log \lvert A \rvert = (A^{-1})^T$, and using that and the trace trick, I got the gradient as

$$ \nabla_\Sigma \ell = -\frac{m}{2}(\Sigma^{-1})^T + \frac{1}{2} \sum\limits_{i=1}^m \mathrm{tr } (x^{(i)} - \boldsymbol \mu)(x^{(i)} - \boldsymbol \mu)^T (\Sigma^{-1})^2 $$

Finally, I got the second gradient as

$$ \nabla_\Sigma^2 \ell = -\frac{m}{2}\left( (\Sigma^{-1})^2 \right)^T - \sum\limits_{i=1}^m \mathrm{tr } (x^{(i)}-\boldsymbol \mu)(x^{(i)}-\boldsymbol \mu)^T (\Sigma^{-1})^3 $$

And I can apply a norm and a max operator on both sides to get the beta-smoothness expression. For both of these equations, I used $d(X^{-1}) = X^{-1}dX X^{-1}$, but I'm not sure if that means $\nabla_X X^{-1} = (X^{-1})^2$ (which I've used for the above two). Are my derivations correct, and if so, what's the correct way to use this identity?

Rahul
  • 165

1 Answers1

2

$ \def\l{\lambda} \def\S{\Sigma} \def\Si{\S^{-1}} \def\h{\frac 12} \def\d{{\large\delta}} \def\A{{\cal A}} \def\H{{\cal H}} \def\LR#1{\left(#1\right)} \def\BR#1{\Big(#1\Big)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\vecc#1{\op{vec}\LR{#1}} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} $Let $w=\LR{x-\mu}$ and truncate terms which don't depend on $\S$.
This leaves the function $$\eqalign{ \l &= ww^T:\Si - \h\log\det(\S) \\ }$$ where $(:)$ denotes the Frobenius product, which is a concise notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \qquad \{ {\rm Frobenius\:norm} \} \\ }$$ Calculate the differential and gradient $$\eqalign{ d\l &= ww^T:\CLR{d\Si} - \h\Si:d\S \\ &= ww^T:\CLR{-\Si d\S\,\Si} - \h\Si:d\S \\ &= -\LR{\Si ww^T\Si + \h\Si}:d\S \\ \grad{\l}{\S} &= -\Si\BR{ww^T + \h\S}\Si \;\doteq\; G \\ \\ }$$ Then calculate the differential and gradient of $G$ $$\eqalign{ dG &= -d\Si\BR{ww^T + \h\S}\Si - \Si\BR{\h\:d\S}\Si - \Si\BR{ww^T + \h\S}d\Si \\ &= \Si\,d\S\,G - \h\Si\,d\S\:\Si + G\,d\S\,\Si \\ &= \LR{\Si\A G - \h\Si\A\Si + G\A \Si}:d\S \\ \grad{G}{\S} &= {\Si\A G - \h\Si\A\Si + G\A \Si} \;\doteq\; \H \\ }$$ where $\H$ is the Hessian and $\A$ is the fourth-order identity tensor $$\eqalign{ \S &= \A:\S &\;=\; \S:\A \\ \A &= \grad{\S}{\S} &\implies\quad \A_{ijkl} = \grad{\S_{ij}}{\S_{kl}} = \d_{ik}\,\d_{jl} \\ }$$ In this derivation, the fact that $\{\S,G\}$ are symmetric matrices was used in several steps.

An alternative to using tensors is to vectorize the matrices using Kronecker products $$\eqalign{ {dG} &= \Si\,d\S\,G - \h\Si\,d\S\:\Si + G\,d\S\,\Si \\ \vecc{dG} &= \LR{{G\otimes\Si}-\h\,{\Si\otimes\Si}+{\Si\otimes G}}\vecc{d\S} \\ \grad{\vecc G}{\vecc\S} &= {G\otimes\Si}-\h\,{\Si\otimes\Si}+{\Si\otimes G} \\ }$$

greg
  • 35,825