3

First of all let me tell you that the answer to this question is likely to confirm a not-so-minor error in a very popular (and excellent) textbook on optimization, as you'll see below.

Background

Suppose that we have a real-valued function $f(X)$ whose domain is the set of $n\times n$ nonsingular symmetric matrices. Clearly, $X$ does not have $n^2$ independent variables; it has $n(n+1)/2$ independent variables as it's symmetric. As is well known, an important use of Taylor expansion is to find the derivative of a function by finding the optimal first-order approximation. That is, if one can find a matrix $D \in \mathbb{R}^{n\times n}$ that is a function of $X$ and satisfies

$$f(X+V) = f(X) + \langle D, V \rangle + \text{h.o.t.}, $$ where $\text{h.o.t.}$ stands for higher-order terms and $\langle \cdot, \cdot \rangle$ is inner product, then the matrix $D$ is the derivative of $f$ w.r.t. $X$.

Question

Now my question is: What is the right inner product $\langle \cdot, \cdot \rangle$ to use here if the matrix is symmetric? I know that if the entries of $X$ were independent (i.e., not symmetric), then the $\text{trace}$ operator would be the correct inner product. But I suspect that this is not true in general for a symmetric matrix. More specifically, my guess is that even if the $\text{trace}$ operator would lead to the correct expansion in the equation above, the $D$ matrix that comes as a result won't give the correct derivative. Here is why I think this is the case.

A while ago, I asked a question about the derivative of the $\log\det X$ function, because I suspected that the formula in the book Convex Optimization of Boyd & Vandenberghe is wrong. The formula indeed seems to be wrong as the accepted answer made it clear. I tried to understand what went wrong in the proof in the Convex Optimization book. The approach that is used in the book is precisely the approach that I outlined above in Background. The authors show that the first-order Taylor approximation of $f(X)=\log\det X$ for symmetric $X$ is $$ f(X+V) \approx f(X)+\text{trace}(X^{-1}V). $$

The authors prove this approximation by using decomposition specific to symmetric matrices (proof in Appenix A.4.1; book is publicly available). Now this approximation is correct but $X^{-1}$ is not the correct derivative of $\log\det X$ for symmetric $X$; the correct derivative is $2X^{-1}-\text{diag}(\text{diag}(X^{-1}))$. Interestingly, the same approximation in the formula above holds for nonsymmetric invertible matrices too (can be shown with SVD decomposition), and in this case it does give the right derivative because the derivative of $\log\det X$ is indeed $X^{-T}$ for a matrix with $n^2$ independent entries. Therefore I suspect that $\text{trace}$ is not the right inner product $\langle \cdot, \cdot \rangle$ for symmetric matrices, as it ignores the fact that the entries of $X$ are not independent. Can anyone shed light on this question?

Added: A simpler question

Based on a comment, I understand that the general answer to my question may be difficult, so let me ask a simpler question. The answer to this question may be sufficient to show what went wrong in the proof in the Convex Optimization book.

Suppose $g(X)$ is a function $g: \mathbb{R}^{n\times n} \to \mathbb R$. Is it true that the first-order Taylor approximaton with trace as inner product, i.e.,

$$g(X+V) \approx g(X) + \text{trace}\left( \nabla g (X)^T V \right), $$

implicitly assumes that the entries of $X$ are independent? In other words, is it true that this approximation may not hold if entries of $X$ are not independent (e.g., if $X$ is symmetric)?

evangelos
  • 440
  • 4
  • 10
  • 2
    If you try to fully generalize the analysis on nonsingular matrices, I am afraid you will soon see the need for the study of Lie groups :) – Miguel May 24 '20 at 15:34
  • I see! :) Well, I added a simpler question the answer to which is probably sufficient for seeing what went wrong in the proof of the book! – evangelos May 24 '20 at 16:10

2 Answers2

4

Consider a pair matrices with elements given by $$\eqalign{ M_{ij} &= \begin{cases} 1 &\text{if }(i=j) \\ \frac{1}{2} & \text{otherwise}\end{cases} \\ W_{ij} &= \begin{cases} 1 &\text{if }(i=j) \\ 2 & \text{otherwise}\end{cases} \\ }$$ which are Hadamard inverses of each other, i.e. $\;M\odot W={\tt1}$

Suppose that you have been given a function, and by hard work you have calculated its gradient $G$ and its Taylor expansion $$f(X+dX) \approx f(X) + G:dX$$ where the colon denotes the Frobenius inner product $\;A:B={\rm Tr}(A^TB)$

Everything looks great until someone points out that your problem has a symmetry constraint $$X={\rm Sym}(X)\doteq\tfrac{1}{2}\left(X+X^T\right)$$ The constraint implies $(X,G)$ are symmetric, so you might think the constrained gradient is $$\eqalign{ H &= {\rm Sym}(G) \\ }$$ but this is not correct. Fortunately, there is a way to calculate $H$ from $G$ $$\eqalign{ H &= W\odot{\rm Sym}(G) = W\odot G \quad\implies\quad G = M\odot H \\ }$$ Substituting this into the Taylor expansion yields $$\eqalign{ f(X) + G:dX &= f(X) + (M\odot H):dX \\ &= f(X) + H:(M\odot dX) \\ &= f(X) + (\sqrt{M}\odot H):(\sqrt{M}\odot dX) \\ }$$ NB: These matrices are symmetric with only $\left(\frac{n(n+1)}{2}\right)$ independent components.

You might think of the last expansion formula as the standard inner product after each factor has been projected using the elementwise square root of the $M$ matrix.

The Frobenius $\times$ Hadamard product generates a scalar triple product, i.e. $$A:B\odot C = \sum_i\sum_j A_{ij}B_{ij}C_{ij}$$ The order of the three matrices does not affect the value of this product.

Interestingly, if you had to enforce a skew constraint, i.e. $$X={\rm Skw}(X)\doteq\tfrac{1}{2}\left(X-X^T\right)$$ then the constrained gradient would satisfy your intuition
$$H={\rm Skw}(G)$$ with $\left(\frac{n(n-1)}{2}\right)$ independent components.

greg
  • 35,825
  • Excellent! So let me summarize to see if I got it correctly. Suppose the domain of $f(X)$ is the set of symmetric matrices, and I somehow managed to come up with an approximation $f(X+dX) \approx f(X) + A:dX $. Now the $A$ here is not the gradient of $f$, but the gradient of $f$ can be obtained as $\nabla f = W\odot H$, using the $W$ that you provided. Is that correct? – evangelos May 25 '20 at 00:40
  • I think you meant $\nabla f=W\odot A$, but yes you've got the idea. – greg May 25 '20 at 00:47
  • Sorry, yes I meant $\nabla f = W\odot A$. Thanks a lot! This clearly works for the specific derivative that I mention in the question, because $\log\det (X+dX) \approx \log\det X + X^{-1}:dX$, but the derivative of $\log\det X$ for symmetric $X$ is $X^{-1} \odot W = 2 X^{-1} - \text{diag}(y_{11},\dots,y_{nn})$, where $y_{ii}$ is the $i$th diagonal entry of $X^{-1}$. – evangelos May 25 '20 at 00:56
  • 1
    The diag part looks better when written as a Hadamard product with the identity matrix, i.e. $2X^{-1}-I\odot X^{-1} = W\odot X^{-1}$ – greg May 25 '20 at 01:04
  • How can this argument can be extended to the Hessian of the function? How to take into account that the matrix in the argument is symmetric? Because also in this case the quadratic form involving the Hessian, i.e. $dX: \mathbb{H} : dX$, should be casted in a way to consider the symmetry. – yes Jul 14 '20 at 08:40
  • @greg Moreover, should not it be: $H= M \odot G$? Because once that we have the hand-worked gradient $G$, the independent variables are only on the upper triangular part and the diagonal. So we have to multiply the off-diagonal part by $1/2$ if we want to consider that they will be counted twice in the matrix inner product. Am I correct? – yes Jul 14 '20 at 13:00
  • @Alberto This great question contains a nice example in which $A$ is a symmetric matrix and $B$ is not. You'll note that the gradients are given by
    $$\eqalign{ \frac{\partial\operatorname{Tr}(A^2)}{\partial A} &= W\odot\big(2A^T\big) \ \frac{\partial\operatorname{Tr}(B^2)}{\partial B} &= 2B^T }$$ in agreement with the above result.
    – greg Jul 14 '20 at 14:48
  • Thank you. I got my error: I considered the derivation already with respect X being symmetric. Now I am wondering: how to compute the Hessian of the function? Is the constrained gradient that has to be derived it again? – yes Jul 28 '20 at 15:02
1

I think that the key problem is that such differential on "sets of matrices with dependent components" is not defined.

If $f:\mathbb{R}^m \rightarrow \mathbb{R}$ is differentiable, then the first order approximation in the direction of $v$ is: $$f(x+v)\approx f(x)+\nabla_f(x)\cdot v $$ with the usual dot product: $$\nabla_f(x)\cdot v=\sum_i \frac{\partial f}{\partial x_i}\,v_i $$

Now, if $m=n^2$ and you fancy reshaping vectors as square matrices and writing everything in uppercase, this is the same as: $$f(X+V)\approx f(X)+tr(D(X)^\top\, V )$$ where the $ij$ component of matrix $D(X)$ is $\frac{\partial\, f}{\partial\, X_{ij}}$ because the trace reproduces the usual dot product: $$tr(D(X)^\top\, V ) = \sum_i\sum_j D(X)_{ij}\,V_{ij}=\frac{\partial\, f}{\partial\, X_{ij}}\,V_{ij}$$

All of this is well known and I have only recalled it to have some notation at hand for the case where the components of $X$ are not "independent". One way to explain the problem in this case is that the domain is no longer $\mathbb{R}^m$ and you have to rewrite the function definition.

I will try to do this rewriting. For instance, let $X=\begin{pmatrix} a& b\\b & c\end{pmatrix}$ and you consider your function as $f:\mathbb{R}^3\to\mathbb{R}$ so that $f(X)=f(a,b,c)$ and $\nabla f=\left(\frac{\partial f}{\partial a},\frac{\partial f}{\partial b},\frac{\partial f}{\partial c}\right)$. But now the gradient cannot be cast into a square matrix. If you just repeat the derivative with respect to $b$ and place it twice on the matrix, then the trace does not recover the dot product but introduces an extra term.

Another way to see what is happening is to note that not every perturbation $V$ is valid, since $X+V$ may not be symmetric.

To summarize, you have to introduce a novel concept of differentiation on a set that is not a linear space, because the differential as such is not defined on such weird sets. (Spoiler alert: manifolds)

You can visualize the problem with a simpler example. Consider the function $f: \mathbb{R}^2 \to \mathbb{R}$, $f(x,y)=\frac{1}{2}(x^2+y^2)$. Then the gradient is $\nabla f(x,y)=(x,y)$. But imagine that an external influence forces the points to remain on the circle: $\mathcal{S}^1=\{(x,y)\in\mathbb{R}^2:x^2+y^2=1\}$, so the components $x,y$ are not "independent". (You can think of a centripetal force in physics or a constraint in optimization). Then, it is obvious that your function is constant, so the gradient must vanish.

And then all the differential geometry of manifolds starts...

Edit: Maybe I have not answered your question. You try to blame on the dot product, and it is true that you have to think a way to rewrite the dot product in matrix form. But I think the issue is more fundamental: it is the derivative itself that must be redefined. I am sure B&V know the rigourous formalism, but they tried to keep their text at a more elementary level. BTW, if your topic is optimization, maybe you can have a look at Absil's excellent book: Optimization Algorithms on Matrix Manifolds but, again, differential geometry is required.

Miguel
  • 3,265
  • actually, the set of invertible symmetric matrices is an open subset of a vector space (more precisely a finite-dimensional real Banach space), so all of the usual differential calculus can be applied. So at this stage, I doubt there's any need for manifolds. – peek-a-boo May 24 '20 at 21:52
  • @peek-a-boo Well, at the end of the day all manifolds can be embedded in linear spaces. But the embedding is not unique. In this case, you have to be aware that the dimension is not $n^2$. – Miguel May 24 '20 at 21:57
  • Sure, manifolds can be embedded but I don't see the relevance to my comment. My comment was mainly regarding your first sentence that "differential on sets of matrices with dependent components is not defined". If $S_n(\Bbb{R})$ is the set of symmetric matrices, then it clearly forms a vector space. If we consider the non-singular ones, i.e $U:= \det^{-1}(\Bbb{R}\setminus {0}) \cap S_n(\Bbb{R})$ then this is an open subset of $S_n(\Bbb{R})$, so all the standard differential calculus can be applied. My point is that at this stage there is no need at all to consider any manifolds. – peek-a-boo May 24 '20 at 22:02
  • @peek-a-boo The problem I see with standard calculus is that you have to choose coordinates to compute partial derivatives. Then you cannot simply put all these derivatives in a matrix because some derivatives would appear twice. So in the context of the OP you can do calculus on $\mathbb{R}^\frac{n(n+1)}{2}$ but then the gradient is not a matrix. – Miguel May 24 '20 at 22:38
  • I see, very helpful! Based on your example in $\mathbb R^3$, I wonder if the following workaround may be a legitimate solution. Suppose $x$ is a vector of size $m=n(n+1)/2$, and let $S(x)$ be a matrix-valued function that creates a symmetric matrix from the $m$ independent entries in $x$. Then, if I can find a vector $g$ that satisfies, $f(S(x+dx)) \approx f(S(x)) + g^T dx$, then the derivative of the original function $f(X)$ will be $\nabla f = S(g)$, because now I used an intermediate function with independent variables. Is that right? At least it seems to work for the $\log\det X$ example! – evangelos May 25 '20 at 01:22
  • I suggest a look at this new paper: https://arxiv.org/pdf/1911.06491.pdf – yes Jul 28 '20 at 14:36