17

According to Appendix A.4.1 of Boyd & Vandenberghe's Convex Optimization, the gradient of $f(X):=\log \det X$ is

$$\nabla f(X) = X^{-1}$$

The domain of the $f$ here is the set of symmetric matrices $\mathbf S^n$. However, according to the book "Matrix Algebra from a Statistician's Perspective" by D. Harville, $\log \det X$ for a symmetric $X$ must be (see eq. 8.12 of book)

$$\log \det X = 2 X^{-1} - \text{diag} (y_{11}, y_{22}, \dots, y_{nn})$$

where $y_{ii}$ represents the $i$th element on the diagonal of $X^{-1}$. Now I'm not a mathematician but to me the formula of Harville seems correct, because he makes use of the fact that the entries of $X$ are not "independent". Indeed, in the case where the entries are ''independent'', Harville provides another formula (eq. 8.8 of his book), which matches that of Boyd & Vandenberghe.

Is this an error on the book of Boyd & Vandenberghe, or am I missing something here? To me it does seem like an error, but at the same time I find this extremely unlikely as the book is very popular and if it were an error it would already be on Errata; it's much more likely that I'm misunderstanding something. This formula has already been mentioned in many questions in this website, but no question or answer that I saw mentions (the possibility of) $\log \det X$ in Boyd & Vandenberghe being wrong.


Edit based on response of Profs. Boyd & Vandenberghe

Prof. Boyd kindly responded to my email about this issue, provided an explanation that he and Lieven Vandenberghe think can can explain the discrepancy between the two formula. In essence, their reply suggests that the discrepancy can be due to the inner product choice. To better explain why, I need to summarize their proof in Appendix A.4.1 of the Convex Optimization book.

The proof is based on the idea that the derivative of a function gives the first-order approximation of the function. That is, the derivative of $f(X)$ can be obtained by finding a matrix $f(X)$ that satisfies

$$f(X+\Delta X) \approx f(X)+\langle D,\Delta X\rangle.$$

In the book Boyd&Vandenberghe use the $\text{trace}(\cdot)$ function as the inner product $\langle \cdot, \cdot \rangle$, and show that

$$f(X+\Delta X) \approx f(X)+\text{trace}(X^{-1}\Delta X).$$

The book is publicly available; how they arrived at this expression can be seen in the Appendix A.4.1. In their reply, Prof. Boyd suggests that they suspect the discrepancy to stem from the inner product use. While they used $\text{trace}(\cdot)$, he suggests that some other people may use $\langle A,B\rangle = \sum_{i<=j} A_{ij}B_{ij}$ . Authors claim that this can explain the discrepancy (although I'm not sure if they looked at the proof of Harville or others about the implicit or non-implicit usage of this inner product), because the trace function puts twice as much weight on the off-diagonal entries.


Some questions where Boyd & Vanderberghe's formula is mentioned:

evangelos
  • 440
  • 4
  • 10

3 Answers3

4

This is a really well done paper that describes what is going on:

Shriram Srinivasan, Nishant Panda. (2020) "What is the gradient of a scalar function of a symmetric matrix?" https://arxiv.org/pdf/1911.06491.pdf

Their conclusion is that Boyd's formula is the correct one, which comes by restricting the Frechet derivative (defined in $\mathbb{R}^{n \times n}$) to the subspace of symmetric n x n matrices, denoted $\mathbb{S}^{n \times n}$. Deriving the gradient in the reduced space of $n(n+1)/2$ dimensions and then mapping back to $\mathbb{S}^{n \times n}$ is subtle and can't be done so simply, leading to the inconsistent result by Harville.

  • Very interesting! I'll read it but it'll take time for me to understand all nuances --this topic pushes the bounds of my knowledge! In the meanwhile, let me say that I don't think lcv's answer (and Harville) can be wrong, here is why. Let $x_{ij}$ be the $ij$th entry of $X$ and $[\nabla f]{ij}$ be the derivative of scalar-valued function $f(X)$. If we define the derivative as the matrix-valued function such that $[\nabla f]{ij} = \frac{\partial f}{\partial x_{ij}}$ (Harville's definition), then Boyd's formula simply can't be right; this is obvious from the $2\times 2$ case in lcv's answer. – evangelos Jul 24 '20 at 00:11
  • From the conclusion of that paper -- "The other definition aims to eliminate the redundant degrees of freedom present in a symmetric matrix and perform the gradient calculation in the space of reduced dimension and finally map the result back into the space of matrices. We showed, both through an example and rigorously through a theorem, that the problem in the second approach lies in the final step as the gradient in the reduced-dimension space is mapped into a symmetric matrix." -- my understanding is that the mapping between 2 x 2 symm matrices and $\mathbb{R}^3$ is not so simple... – digbyterrell Jul 24 '20 at 18:08
  • 1
    This is an interesting preprint. I did not follow it until I had read the whole thing, so I wrote my own summary. Harville's answer appears to match the form that the preprint shows, in Section 2.2. to be incorrect. – Joe Mack Aug 04 '20 at 20:06
  • relevant? Prove $\frac{\partial \rm{ln}|X|}{\partial X} = 2X^{-1} - \rm{diag}(X^{-1})$.. Here I say 'We first note that for the case where the elements of X are independent, a constructive proof involving cofactor expansion and adjoint matrices can be made to show that $\frac{\partial ln|X|}{\partial X} = X^{-T}$ (Harville). This is not always equal to $2X^{-1}-diag(X^{-1})$. The fact alone that X is positive definite is sufficient to conclude that X is symmetric and thus its elements are not independent.' – BCLC Apr 16 '21 at 10:08
  • @JoeMack I read your summary and noticed that your definition of vec(X) is different than the one used in the Panda paper, i.e. you use row-stacking while Panda uses column-stacking. I believe that column-stacking is the more widely used definition. – greg Dec 23 '21 at 13:04
3

$ \def\p#1#2{\frac{\partial #1}{\partial #2}} \def\g#1#2{\p{#1}{#2}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} $To summarize the example used in the accepted answer $$\eqalign{ \phi(A) &= \log\det(A) = \log(ab-x^2) \\ A &= \m{a&x\\x&b} \quad\implies\quad \g{\phi}{A} = \frac{1}{ab-x^2}\m{b&-2x\\-2x&a} \\ }$$ Let's use this in a first-order Taylor expansion (use a colon to denote the matrix inner product) $$\eqalign{ \phi(A+dA) &= \phi(A) + \g{\phi}{A}:dA \\ d\phi &= \g{\phi}{A}:dA \\ &= \frac{1}{ab-x^2}\m{b&-2x\\-2x&a}:\m{da&dx\\dx&db} \\ &= \frac{a\,db-4x\,dx+b\,da}{ab-x^2} \\ }$$ which disagrees with the direct (non-matrix) calculation $$\eqalign{ d\log(ab-x^2) &= \frac{a\,db-2x\,dx+b\,da}{ab-x^2} \\ }$$ On the other hand, using Boyd's result for the matrix calculation yields $$\eqalign{ d\phi &= \g{\phi}{A}:dA = \frac{1}{ab-x^2}\m{b&-x\\-x&a}:\m{da&dx\\dx&db} = \frac{a\,db-2x\,dx+b\,da}{ab-x^2} \\ }$$ which is correct.

Carefully read the Srinivasan-Panda paper (which has been mentioned in other answers) for an explanation of why Harville (and many other references) are mistaken.

Harville's quantity may be useful in certain contexts, but it is not a gradient.

greg
  • 35,825
2

Let me call $X_0$ the symmetric matrix with entries $(X_0)_{i,j} = x_{i,j}$. We have by assumptions $x_{i,j}=x_{j,i}$. Since $X_0$ is symmetric it can be diagonalized (if it's real). Its determinant is the product of the eigenvalues $\lambda_k$. So for a symmetric matrix $X$

$$ \ln\det X = \sum_k \ln(\lambda_k ) $$

Assume $X$ depends on a parameter $t$. It's derivative would be

$$ \frac{d}{dt} \ln\det X(t) = \sum_k \frac{\dot{\lambda}_k}{\lambda_k} $$

Say we want the derivative of $X_0$ with respect to $x_{i,j}$ for $i\neq j$. Then, defining

\begin{align} V &= |i\rangle \langle j | + |j\rangle \langle i | \\ X(t) &= X_0 +tV, \end{align}

($V$ is the matrix with all zeros except ones at position $(i,j)$ and $(j,i)$). We have

$$ \frac{\partial}{\partial x_{i,j}} \ln\det X_0 = \left . \frac{d}{dt} \ln\det X(t) \right \vert_{t=0}= \sum_k \frac{\dot{\lambda}_k}{\lambda_k} $$

Now

$$ \dot{\lambda}_k = \langle v_k | V| v_k \rangle $$

where $|v_k \rangle$ is the eigenvector of $X_0$ corresponding to $\lambda_k$. Hence (for $i\neq j$)

\begin{align} \frac{\partial}{\partial x_{i,j}} \ln\det X_0 & = \sum_k \frac{ \langle j| v_k \rangle \langle v_k |i \rangle }{\lambda_k} + i \leftrightarrow j \\ &= \left ( X^{-1} \right)_{j,i} +\left ( X^{-1} \right)_{i,j} \\ &= 2\left ( X^{-1} \right)_{i,j} \end{align}

Let us now compute the derivative with respect to $x_{i,i}$. We reason exactly as before with $V = |i\rangle \langle i |$ and we get

\begin{align} \frac{\partial}{\partial x_{i,i}} \ln\det X_0 & = \sum_k \frac{ \langle i| v_k \rangle \langle v_k |i \rangle }{\lambda_k} \\ &= \left ( X^{-1} \right)_{i,i}. \end{align}

Hence the second formula is the correct one for a symmetric matrix. The first formula is correct for a non symmetric matrix. All formulae require of course the matrix to be non-singular.

Added

Let's explain the subtlety with one example that should clarify the matter. Conside the following symmetric matrix:

$$ A=\left(\begin{array}{cc} a & x\\ x & b \end{array}\right) $$

Now,

$$\log\det(A) = \log(ab-x^2)$$

and so

\begin{align} \frac{\partial \log\det(A)}{\partial a } &= \frac{b}{ab-x^2} \\ \frac{\partial \log\det(A)}{\partial x } &= - \frac{2x}{ab-x^2} \\ \frac{\partial \log\det(A)}{\partial b } &= \frac{a}{ab-x^2} \end{align}

And compare this with

$$ A^{-1} = \frac{1}{(ab-x^2)} \left(\begin{array}{cc} b & -x\\ -x & a \end{array}\right) $$

This simple calculation agrees with the formula above (cfr. the factor of 2). As I said in the comment, the point is to be clear about what are the independent variables or what is the variation that we are using. Here I considered variation $V$ which is symmetric, as this seems to be the problem's assumption.

Obviously if you consider

$$ A'=\left(\begin{array}{cc} a & y\\ x & b \end{array}\right) $$

you will obtain $\nabla A' \sim {A'}^{-1}$

lcv
  • 2,506
  • Thank you. I'm a bit puzzled though because this means that the book of Boyd&Vanderberghe (cited >50K times) is wrong with the computation of $\log \det X$ for symmetric $X$. I'm surprised particularly because the incorrect formula is used in many places in the book – evangelos May 10 '20 at 01:35
  • I understand. It could be somehow a problem of notation. The formula is correct for the derivative of the element at position $(i,j)$. If you write $X$ as dependent on only $n(n+1)/2$ variables, meaning that in the lower triangular part you have the same elements as in the upper triangular part, then the formula here is the correct one. – lcv May 10 '20 at 03:09
  • Seen differently we are computing a variation of a function of a matrix. If you assume that the variation (my $V$) has the same symmetry as the $X$ then you get the formula above (my $V$ is symmetric). But you can also decide to compute the variation of a symmetric matrix with a non-symmetric perturbation. In which case you get the other formula. So it depends on what you decide are the independent variables. – lcv May 10 '20 at 03:13
  • Prof. Boyd just kindly responded my email, saying the formula $X^{-1}$ is correct, and "[t]his is easily shown. It comes down to showing that for small symmetric matrix $V$, and symmetric positive definite $X$, we have

    $$\log \det (X+V) \approx \log \det (X)+\text{tr}(X^{-1}V).$$

    [T]his formula does not hold when $X$ is not symmetric."

    Is this at odds with what you suggested, i.e. the first formula is correct with your (i.e., symmetric) $V$? I can't really tell. Full proof is at https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf (section A.4.1) in case you are interested. Thanks!

    – evangelos May 10 '20 at 12:49
  • The authors kindly provided further information that is likely to explain the discrepancy; I've updated my question based on their reply. – evangelos May 10 '20 at 17:56
  • It's good that you emailed him. I think he has not appreciated the subtlety though. I'll explain it with one example – lcv May 10 '20 at 18:35
  • 1
    The problem has nothing to do with what you wrote in the new version of the question. And the formula that you wrote in the comments simply requires $X$ to be invertible. – lcv May 10 '20 at 19:32
  • 1
    I see. The 2x2 example has been very helpful, I wish I had thought of it in my first email to the authors. The point is how to interpret independent variables and related to that how to apply the perturbations -- with a symmetric or non-symmetric $V$. Authors claimed that they considered perturbations with symmetric $V$ with $n(n+1)/2$ independent variables, in which case I guess the formula $X^{-1}$ is simply wrong... – evangelos May 11 '20 at 23:57
  • 1
    Yes exactly. If that is what the authors claim than it is simply wrong. If one considers only $n(n+1)/2$ independent variables, differentiation is equivalent to a symmetric perturbation $V$ which leads to the formula above. – lcv May 12 '20 at 00:01
  • 1
    I added a downvote because the paper by Srinivasan & Panda cited in my answer below convinced me that Boyd's solution was actually the correct one... But let me know if I am misunderstanding something! – digbyterrell Jul 22 '20 at 21:38
  • relevant? Prove $\frac{\partial \rm{ln}|X|}{\partial X} = 2X^{-1} - \rm{diag}(X^{-1})$.. Here I say 'We first note that for the case where the elements of X are independent, a constructive proof involving cofactor expansion and adjoint matrices can be made to show that $\frac{\partial ln|X|}{\partial X} = X^{-T}$ (Harville). This is not always equal to $2X^{-1}-diag(X^{-1})$. The fact alone that X is positive definite is sufficient to conclude that X is symmetric and thus its elements are not independent.' – BCLC Apr 16 '21 at 09:57