I'm trying to compute the beta-smoothness of the log-likelihood for Gaussian Discriminant Analysis, with respect to the covariance matrix $\Sigma$.
We have
$$
\ell(\phi, \boldsymbol \mu_0, \boldsymbol \mu_1, \boldsymbol \Sigma) = \sum_{i=1}^m \left( y^{(i)} \log \phi + (1 - y^{(i)}) \log (1 - \phi) -\frac{1}{2} \log |\boldsymbol \Sigma|+ C + (\textbf{x}^{(i)}-\boldsymbol \mu)^T \boldsymbol \Sigma^{-1} (\textbf{x}^{(i)} - \boldsymbol \mu) \right)
$$
I used this StackExchange answer to deduce that $\nabla_A \log \lvert A \rvert = (A^{-1})^T$, and using that and the trace trick, I got the gradient as
$$ \nabla_\Sigma \ell = -\frac{m}{2}(\Sigma^{-1})^T + \frac{1}{2} \sum\limits_{i=1}^m \mathrm{tr } (x^{(i)} - \boldsymbol \mu)(x^{(i)} - \boldsymbol \mu)^T (\Sigma^{-1})^2 $$
Finally, I got the second gradient as
$$ \nabla_\Sigma^2 \ell = -\frac{m}{2}\left( (\Sigma^{-1})^2 \right)^T - \sum\limits_{i=1}^m \mathrm{tr } (x^{(i)}-\boldsymbol \mu)(x^{(i)}-\boldsymbol \mu)^T (\Sigma^{-1})^3 $$
And I can apply a norm and a max operator on both sides to get the beta-smoothness expression. For both of these equations, I used $d(X^{-1}) = X^{-1}dX X^{-1}$, but I'm not sure if that means $\nabla_X X^{-1} = (X^{-1})^2$ (which I've used for the above two). Are my derivations correct, and if so, what's the correct way to use this identity?