6

I would like to differentiate the mahalanobis distance:

$$D(\textbf{x}, \boldsymbol \mu, \Sigma) = (\textbf{x}-\boldsymbol \mu)^T\Sigma^{-1}(\textbf{x}-\boldsymbol \mu)$$

where $\textbf{x} = (x_1, ..., x_n) \in \mathbb R^n, \;\boldsymbol \mu = (\mu_1, ..., \mu_n) \in \mathbb R^n$ and $$\Sigma = \left( \begin{array}{ccc} E[(X_1-\mu_1)(X_1-\mu_1)] & \cdots & E[(X_1-\mu_1)(X_n-\mu_n)] \\ \vdots & \ddots & \vdots \\ E[(X_n-\mu_n)(X_1-\mu_1)] & \cdots & E[(X_n-\mu_n)(X_n-\mu_n)] \end{array} \right)$$

$\;$

is the covariance matrix. I want to differentiate $D$ with respect to $\boldsymbol\mu$ and $\Sigma$. Can someone show me how to do this? In other words, how to calculate:

$$\frac{\partial D}{\partial \boldsymbol \mu} \;\;\text{and}\;\;\frac{\partial D}{\partial \Sigma}$$? Thnx for any help!

I got the motivation for my question from this source (page 13, EM-algorithm):

http://ptgmedia.pearsoncmg.com/images/0131478249/samplechapter/0131478249_ch03.pdf

jjepsuomi
  • 8,619

2 Answers2

8

For convenience, define the variables $$\eqalign{ \boldsymbol{z} &= \boldsymbol{x-\mu} \cr \boldsymbol{B} &= \boldsymbol{\Sigma}^{-1} \cr } $$

and note their differentials $$\eqalign{ \boldsymbol{dz} &= \boldsymbol{dx = -d\mu} \cr \boldsymbol{dB} &= \boldsymbol{-B \cdot dB^{-1} \cdot B} \cr &= \boldsymbol{-B \cdot d\Sigma \cdot B} \cr } $$ $$ $$ Next, re-cast your objective function (taking advantage of the symmetry of $\boldsymbol B$) in terms of these variables $$\eqalign{ D &= \boldsymbol{B:zz} \cr dD &= \boldsymbol{dB:zz + 2B:z\,dz} \cr &= \boldsymbol{zz:dB + 2(B\cdot z)\cdot dz} \cr } $$

and take derivatives $$\eqalign{ \frac{\partial D}{\partial \boldsymbol z} &= \boldsymbol{0 + 2(B\cdot z)} \cr \cr \frac{\partial D}{\partial \boldsymbol B} &= \boldsymbol{zz + 0} \cr } $$ $$ $$ Now use the chain rule to revert to the original variables.

For $\boldsymbol\mu$ we have $$\eqalign{ dD &= \frac{\partial D}{\partial \boldsymbol z}\cdot \boldsymbol{dz} \cr &= \boldsymbol{2(B\cdot z)\cdot (-d\mu)} \cr \cr \frac{\partial D}{\partial \boldsymbol \mu} &= \boldsymbol{-2(B\cdot z)} \cr &= \boldsymbol{-2\Sigma^{-1}\cdot (x-\mu)} \cr } $$ $$ $$ And for $\boldsymbol\Sigma$ $$\eqalign{ dD &= \frac{\partial D}{\partial \boldsymbol B}: \boldsymbol{dB} \cr &= \boldsymbol{zz:(-B\cdot d\Sigma\cdot B)} \cr &= \boldsymbol{(-B\cdot zz\cdot B):(d\Sigma)} \cr \cr \frac{\partial D}{\partial \boldsymbol \Sigma} &= \boldsymbol{-B\cdot zz\cdot B} \cr &= \boldsymbol{-\Sigma^{-1}\cdot (x-\mu)(x-\mu)\cdot \Sigma^{-1}} \cr } $$

lynne
  • 252
  • 1
    +1 Thank you for your help =) Could you perhaps elaborate more on why $dz = dx = -d\mu$ and $dB = -B\cdot d\Sigma\cdot B$? Also what does $:$ mean? =) is it division? – jjepsuomi Jan 01 '14 at 21:05
  • 2
    $\boldsymbol{A:B}$ in index notation is $A_{ik}B_{ik}$; it can also be expressed as $\rm tr(\boldsymbol{A\cdot B^T})$. Given the change-of-variable $\boldsymbol{z \equiv x - y}$, if you hold $\boldsymbol{y}$ constant and allow $\boldsymbol{x}$ to vary, then $\boldsymbol{dz = dx}$. If you take $\boldsymbol{x}$ as constant, then $\boldsymbol{dz = -dy}$ The formula for $\boldsymbol{dB = d\Sigma^{-1}}$ is well known; it can be derived by differentiating the equation $\boldsymbol{\Sigma\cdot \Sigma^{-1} = I}$. – lynne Jan 02 '14 at 00:23
  • 1
    +1 @lynne Thank you very much =) One last thing to wrap this up, could you give me a link etc. to a source where I can study about this formula $dB = d\Sigma^{-1}$ =) – jjepsuomi Jan 02 '14 at 05:58
  • 2
    http://en.wikipedia.org/wiki/Invertible_matrix#Derivative_of_the_matrix_inverse – lynne Jan 02 '14 at 06:27
  • 1
    @lynne I think that if you expand your answer with full details, it could be a very useful resource (and will be heavily upvoted).. – MadHatter Jun 03 '17 at 14:53
  • 1
    @MadHatter agreed, I'm not familiar with : identities, and there are a lot of details here which are not clear to me. Good starting point though. – Jake Levi Oct 28 '21 at 16:24
  • 1
    Note that the step from zz:(-B.dSigma.B) == (-B.zz.B):dSigma follows from the cyclic property of the trace and the associativity of matrix multiplication (note that A:B is defined as Tr(A.B^T)). The statement dD = (partial dD/DA):dA (where D is scalar and A is a matrix) follows from the total differential of D in terms of every element of the matrix A. – Jake Levi Oct 28 '21 at 17:31
  • 1
    Following this Mathematics Stack Exchange answer, we can derive $d(A^{-1}) = -(A^{-1})\cdot dA\cdot(A^{-1})$ as follows: if the matrix $A$ is perturbed by a small matrix $dA$, the corresponding change in $A^{-1}$ can be expressed as $(A + dA)^{-1} - A^{-1} = ((A + dA)^{-1} \cdot A \cdot A^{-1}) - ((A + dA)^{-1}\cdot (A + dA) \cdot A^{-1}) = (A + dA)^{-1} \cdot (A - (A + dA)) \cdot A^{-1} = -(A + dA)^{-1} \cdot dA \cdot A^{-1} = -A^{-1} \cdot dA \cdot A^{-1}$ (the final inequality follows because dA is small) – Jake Levi Oct 28 '21 at 17:54
3

We have:

$$\eqalign{ D&=(x-\mu)^T\Sigma^{-1}(x-\mu) \cr \Sigma&=\Sigma^T \cr D&\in\mathbb{R} \cr x,\mu&\in\mathbb{R}^N \cr \Sigma&\in\mathbb{R}^{N\times N} \cr } $$

We want to find $\frac{\partial D}{\partial \mu}\in\mathbb{R}^N$ and $\frac{\partial D}{\partial \Sigma}\in\mathbb{R}^{N\times N}$, with:

$$\eqalign{ \left(\frac{\partial D}{\partial \mu}\right)_i&=\frac{\partial D}{\partial \mu_i} \cr \left(\frac{\partial D}{\partial \Sigma}\right)_{ij}&=\frac{\partial D}{\partial \Sigma_{ij}} \cr } $$

We'll start by finding $\frac{\partial D}{\partial \mu}$, by first noting that we can express $\frac{\partial D}{\partial \mu}$ using the total differential of a small change $dD$ in $D$ with respect to a small change $d\mu$ in the vector $\mu$ as follows (I'll abuse notation slightly and write $D(\mu)$ as a function of $\mu$):

$$\eqalign{ D(\mu) &= (x-\mu)^T\Sigma^{-1}(x-\mu) \\ D(\mu + d\mu) - D(\mu) =& \sum_{i=1}^N\left[ \frac{\partial D}{\partial \mu_i} d\mu_i \right] \\ =& \left( \frac{\partial D}{\partial \mu} \right)^T d\mu \\ =& (x-\mu - d\mu)^T\Sigma^{-1}(x-\mu - d\mu) - (x-\mu)^T\Sigma^{-1}(x-\mu) \\ =& (x-\mu)^T\Sigma^{-1}(x-\mu) - (x-\mu)^T\Sigma^{-1}d\mu \\ & - d\mu^T\Sigma^{-1}(x-\mu) + d\mu^T\Sigma^{-1}d\mu \\ & - (x-\mu)^T\Sigma^{-1}(x-\mu) \\ =& -2 (x-\mu)^T\Sigma^{-1}d\mu \\ =& 2 \Bigl(\Sigma^{-1} (\mu - x) \Bigr)^Td\mu \\ \Rightarrow \quad \frac{\partial D}{\partial \mu} =& 2 \Sigma^{-1} (\mu - x) }$$

To find $\frac{\partial D}{\partial \Sigma}$, we'll similarly note that the partial derivative $\frac{\partial D}{\partial A}$ of the scalar $D$ with respect to any matrix $A$ can be expressed in terms of the total differential of a small change $dD$ in $D$ with respect to a small change $dA$ in the matrix $A$, which we can express in terms of the trace operator $\mathrm{Tr}(A) = \sum_{i}{\Bigl[A_{ii}\Bigr]}$ as follows:

$$\eqalign{ dD&=\sum_{i,j}{\left[\frac{\partial D}{\partial A_{ij}}dA_{ij}\right]} \cr &=\sum_{i,j}{\left[\left(\frac{\partial D}{\partial A}\right)_{ij}dA_{ij}\right]} \cr &=\sum_{i,j}{\left[\left(\frac{\partial D}{\partial A}\right)^T_{ji}dA_{ij}\right]} \cr &=\sum_{j}{\left[\left(\frac{\partial D}{\partial A}^TdA\right)_{jj}\right]} \cr &=\mathrm{Tr}\left(\frac{\partial D}{\partial A}^TdA\right) \cr } $$

We'll now introduce the notation $A:B=\mathrm{Tr}(A^TB)$ (note that the $:$ operator can be thought of like a dot-product between two matrices, which takes two matrices as input and returns a scalar, by multiplying the two matrices element-wise and summing all the elements), and note the following identities of the trace operator:

$$\eqalign{ \mathrm{Tr}(AB) &= \sum_{i}{\left[(AB)_{ii}\right]}=\sum_{i,j}{\left[(A_{ij}B_{ji})\right]}=\sum_{i,j}{\left[B_{ji}A_{ij}\right]}=\sum_{j}{\left[(BA)_{jj}\right]}=\mathrm{Tr}(BA) \\ \mathrm{Tr}(A) &= \sum_{i}{\left[A_{ii}\right]} = \sum_{i}{\left[A^T_{ii}\right]} = \mathrm{Tr}(A^T) \\ \mathrm{Tr}(A^T B^T) &= \mathrm{Tr}((BA)^T) = \mathrm{Tr}(BA) = \mathrm{Tr}(AB) }$$

Therefore we have, for any matrix $A$:

$$\eqalign{ dD &= \frac{\partial D}{\partial A}:dA \\ } $$

Returning to $\frac{\partial D}{\partial \Sigma}$ (again I'll abuse notation slightly and write $D(\Sigma)$ as a function of $\Sigma$, and also introduce $z = x - \mu$ for brevity):

$$\eqalign{ D(\Sigma) &= z^T\Sigma^{-1}z \\ D(\Sigma + d\Sigma) - D(\Sigma) &= \frac{\partial D}{\partial \Sigma}:d\Sigma \\ &= z^T (\Sigma + d\Sigma)^{-1} z - z^T \Sigma^{-1} z \\ &= z^T \Bigl((\Sigma + d\Sigma)^{-1} - \Sigma^{-1} \Bigr) z \\ &= z^T \Bigl((\Sigma + d\Sigma)^{-1} \cdot \Sigma \cdot \Sigma^{-1} - (\Sigma + d\Sigma)^{-1} \cdot (\Sigma + d\Sigma) \cdot \Sigma^{-1} \Bigr) z \\ &= z^T (\Sigma + d\Sigma)^{-1} \cdot \Bigl(\Sigma - \Sigma - d\Sigma\Bigr) \cdot \Sigma^{-1} z \\ &= - z^T (\Sigma + d\Sigma)^{-1} \cdot d\Sigma \cdot \Sigma^{-1} z \\ &= - z^T \Sigma^{-1} \cdot d\Sigma \cdot \Sigma^{-1} z \\ &= \mathrm{Tr}\Bigl( - z^T \Sigma^{-1} \cdot d\Sigma \cdot \Sigma^{-1} z \Bigr) \\ &= \mathrm{Tr}\Bigl( - z z^T \Sigma^{-1} \cdot d\Sigma \cdot \Sigma^{-1} \Bigr) \\ &= \mathrm{Tr}\Bigl( - \Sigma^{-1} z z^T \Sigma^{-1} \cdot d\Sigma \Bigr) \\ &= - \Sigma^{-1} z z^T \Sigma^{-1} : d\Sigma \\ \Rightarrow \quad \frac{\partial D}{\partial \Sigma} &= - \Sigma^{-1} z z^T \Sigma^{-1} } $$

Jake Levi
  • 183