0

How can I find the gradient

$$\nabla_{u} \left(x^T \left(A \, \mbox{diag}(u)\, A^T \right)^{-1} x \right)$$

where $x \in \mathbb{R}^n$, $u \in \mathbb{R}^d$, are vectors and $A \in \mathbb{R}^{n \times d}$ is a matrix?

I referred to The Matrix Cookbook but I cannot find a standard formula there for this expression, and also don't know how to apply some kind of matrix chain rule to compute the derivative in multiple steps. I'd appreciate any pointers.


What I tried so far

Let $s=x^T M^{-1} x$, where $M=A \, diag(u)\, A^T$. Then $$\frac{\partial s}{\partial u} = \frac{\partial s}{\partial M^{-1}} \cdot \frac{\partial M^{-1}}{\partial M} \cdot \frac{\partial M}{\partial u}$$

From the matrix cookbook, I get $\frac{\partial s}{\partial M^{-1}} = xx^T$, $\frac{\partial M^{-1}}{\partial M} = -M^{-1}M^{-1}$, but I don't know how to get $\frac{\partial M}{\partial u}$, and how exactly to combine the above expressions - do I simply matrix multiply them?

elexhobby
  • 1,607
  • 11
  • 18

1 Answers1

2

$ \def\p{\partial} \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\grad#1#2{\frac{\p #1}{\p #2}} $For typing convenience, define the matrix variables $$\eqalign{ U &= U^T = \Diag{u} \quad&\implies\quad dU = \Diag{du} \\ B &= B^T = \LR{AUA^T}^{-1} \quad&\implies\quad dB = -B\LR{A\;dU\,A^T}B \\ }$$ and the Frobenius product, which is a convenient notation for the trace, i.e.
$$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{AB^T} \\ A:A &= \big\|A\big\|^2_F \\ }$$ Write the objective function using this notation. Then calculate the differential and gradient. $$\eqalign{ s &= x^TBx \\&= xx^T:B \\ ds &= xx^T:dB \\ &= -xx^T:B{A\;dU\,A^T}B \\ &= -{A^TBxx^TBA}:\Diag{du} \\ &= -\diag{A^TBxx^TBA}:{du} \\ \grad{s}{u} &= -\diag{A^TBxx^TBA} \\\\ }$$


The properties of the underlying trace function allow the terms in a Frobenius to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ CA:B &= C:BA^T = A:C^TB \\ }$$ Unlike the Chain Rule, with the differential approach there is no need to calculate higher-order tensors like $\LR{\grad{M}{u},\;\grad{M^{-1}}{M},\;etc}.\;$ Further, the differential $dB$ obeys the same rules of algebra as the matrix $B$, making it easy to manipulate. Higher-order tensors do not.

greg
  • 35,825