6

Thanks to the scientific community, things are getting clear relatively to the question: what is the gradient of a function $f(X)$ when $X$ is a symmetric matrix?.

In particular, I report here some useful links that addressed this question in the past and can be used as a reference to proceed further on this discussion:

  1. Understanding notation of derivatives of a matrix
  2. Taylor expansion of a function of a symmetric matrix
  3. https://arxiv.org/pdf/1911.06491.pdf

In a nutshell, we can say that, when involving a function with matrix argument, we have to distinguish between two "different", but related, gradients:

  1. the unconstrained gradient $G$, computed with standard matrix calculus without assuming dependent variables in the matrix $X$, and used for the computation of the differential of the function, i.e. $G:dX$
  2. the constrained gradient $S$, that considers only the independent variables of the matrix $X$.

These two gradients are related by the expression:

$$S=G+G^{T}-I \circ G $$

and it turns out that the first-order differential of the function $f$ at a given point $X$ after a perturbation $\Delta X$ can be computed as:

$$ d f=\sum_{i, j} G_{i j} d X_{i j} = \sum_{i \geq j} S_{i j} d X_{i j}$$

It is important however to note how, in an iterative algorithm that updates a variable $X^{k+1}$ (such as in gradient descent), we have to use the constrained gradient $S$ and not the gradient $G$, due to the fact that $X$ is symmetric while the gradient $G$ could be not symmetric.

More information can be found in the above links, that explain the relation also in terms of $vec(\cdot)$ and $vech(\cdot)$ operators.

Coming to my question. I want now to find the Hessian of the function $f(X)$, that in theory is a $4$th order tensor and we already know the mangy road crisscrossed to get to the gradient.

To start, is it correct to perturb the first-order differential (with the unconstrained gradient)? If yes, I will reach a scalar quadratic form. For instance, if we consider as function $f(X)=\log \operatorname{det} X$, we know that the second order approximation with perturbation in $U$ and $V$ is given by (and I reference this question Second order approximation of log det X):

$$-\operatorname{tr}\left(X^{-1} U X^{-1} V\right) = - \operatorname{vec}(U^{\top})^{\top}(X^{-\top} \otimes X^{-1}) \operatorname{vec}(V)$$

We can arrive at the Hessian in matrix form $X^{-\top} \otimes X^{-1}$.

My first question is: how to write it in a tensor form?

And second question is: how to reach in this case our constrained Hessian?

yes
  • 878

1 Answers1

3

If a matrix is symmetric, then by definition it contains redundant information. This redundancy can be eliminated by using the half-vec representation, i.e. $$\eqalign{ x &= {\rm vech}(X) \;\iff\; X = {\rm unvech}(x) \\ }$$ It may be necessary to temporarily reconstitute the matrix in order to evaluate matrix-specific funtions (like the inverse, trace or determinant), but otherwise any iterative process (like gradient descent or quasi-Newton) should be performed within the half-vec space.

There are two advantages to the half-vec representation. The first is that $x$ is unconstrained, so there is no need to enforce a symmetry constraint between steps. The second is that $x$ is a vector, so it is not necessary to invoke 4th order tensors in order to calculate the Hessian. $\,$ Such tensors are required when the independent variable is a matrix.

For the example function
$$\eqalign{ \phi &= \log\det{\rm unvech}(x) \\ &= \log\det X \\ d\phi &= X^{-1}:dX \\ &= {\rm vec}\big(X^{-1}\big):{\rm vec}(dX) \\ &= {\rm vec}\big(X^{-1}\big):D\,dx \qquad&\big({\rm Duplication\,matrix}\big)\\ &= D^T{\rm vec}\big(X^{-1}\big):dx \\ \frac{\partial\phi}{\partial x} &= D^T{\rm vec}\big(X^{-1}\big) \;\doteq\; g \qquad&\big({\rm half\,vec\,gradient}\big)\\ }$$ The Hessian matrix can be calculated as $$\eqalign{ dg &= D^T{\rm vec}\big(dX^{-1}\big) \\ &= -D^T{\rm vec}\big(X^{-1}\,dX\,X^{-1}\big) \\ &= -D^T\big(X^{-1}\otimes X^{-1}\big)\,{\rm vec}(dX) \\ &= -D^T\big(X\otimes X\big)^{-1}D\,dx \\ \frac{\partial g}{\partial x} &= -D^T\big(X\otimes X\big)^{-1}D \;\doteq\; H \qquad&\big({\rm half\,vec\,Hessian}\big)\\ }$$ In this form it is obvious that $H^T=H,\,$ as it should.

Then an iteration step would look like $$\eqalign{ x_{k+1} &= x_k - \lambda_k g_k \qquad&\big({\rm gradient\,step}\big)\\ x_{k+1} &= x_k - \lambda_k H_k^{-1} g_k \qquad&\big({\rm Newton\,step}\big) \\ }$$ Once the solution vector $x_\infty$ is calculated, it can be put in matrix form $$X_\infty = {\rm unvech}(x_\infty)$$ If you insist on carrying out iterations in the symmetric matrix space, the Newton step is hopeless but the gradient step can be salvaged by applying the ${\rm unvech}$ operation to each term $$\eqalign{ X_{k+1} &= X_k - \lambda_k\big(2X_k^{-1}-I\circ X_k^{-1}\big) \\\\ }$$ Although nothing can be done for the Newton step in the matrix space, there is an important simplification in the half-vec space $$\eqalign{ H_k^{-1} &= D^+(X_k\otimes X_k){D^+}^{T} \\ }$$ where $D^+$ denotes the pseudoinverse of the duplication matrix, which is constant for all steps. So there is no need to calculate matrix inverses at each step, beyond the one required to calculate $g$ itself.

In fact, you don't need to calculate a pseudoinverse either, since $D^+$ is equal to $D^T$ but with its rows normalized so as to make them stochastic, i.e. to sum to ${\tt1}$.


NB:   In the steps above, a colon denotes the trace/Frobenius product, i.e. $$\eqalign{ A:B = {\rm Tr}(A^TB) = {\rm Tr}(AB^T) }$$ Also, several steps take advantage of the fact that $X$ is symmetric.

Update

There are some functions which losslessly shuffle and/or reshape the elements of a matrix without changing their values.

Examples include:   vec, transpose, blockvec, vecpose, and their inverses.

Let $S(X)$ denote one of these shuffling functions. $\,S$ exhibits some interesting properties with respect to addition, subtraction, and Hadamard / Frobenius multiplication. $$\eqalign{ S(X_1) \pm S(X_2) &= S(X_1 \pm X_2) \\ S(X_1) \circ S(X_2) &= S(X_1 \circ X_2) \\ S(X_1):S(X_2) &= X_1:X_2 \\ }$$ In particular, the subtraction property means that $$\eqalign{ dS(X) &= S(X+dX) - S(X) \;=\; S(dX) \\ }$$ These properties were used implicitly in some of the steps of the derivation above, specifically for the vec/unvec functions.
greg
  • 35,825
  • Thank you @greg for your exhaustive answer. Just to grasp the concept: I see that here you took the unconstrained gradient $X^{-1}$ that is inside the first-order differential and not the constrained one. My first question is: why do you take the unconstrained? And second question: are you perturbing the first-order differential or the gradient? It would be really nice if, for the sake of completeness and formalism, you could add as a notation also the variables and the perturbation in each left-side of the expression, so it is clear what is a constant and what is a variable. Thanks – yes Jul 30 '20 at 09:23
  • First answer: banish the concept of constrained gradient.$;$ What you call an unconstrained gradient is the only idea that is well-defined (as evidenced by all the questions on this site). $;$ For the second answer consider the following: $$df = G:dX \qquad\iff\qquad G = \left(\frac{\partial f}{\partial X}\right)$$ The information content of the LHS is identical to that of the RHS. If you know the differential, you know the gradient. The two concept should be fused together in your head as two ways of expressing the same idea. – greg Jul 30 '20 at 13:42
  • For the second answer, yes, the information is the same. I meant if the perturbation has to be applied to the gradient $G$ or to the differential (that is function of the first perturbation). I think that your first line to compute the Hessian matrix should be $d g=D^{T} \operatorname{vec}\left((X +d X)^{-1}\right)$. For this I would like if you could add some extra formalism on the steps – yes Jul 30 '20 at 13:51
  • The duplication, elimination, commutation, and identity matrix are well-known constant matrices. Anything which depends on $x$ can vary. This includes: $;\phi,,g,,H,,G,,$ and of course $,X$. – greg Jul 30 '20 at 13:51
  • Well, the gradient is applied at a given point $X$, so the differential is only function of the perturbation.. – yes Jul 30 '20 at 13:53
  • Where did you explicitly use the shuffling functions mentioned in UPDATE 2? And another question: I don't get the step $\begin{array}{l} =\operatorname{vec}\left(X^{-1}\right): D d x \ =D^{T} \operatorname{vec}\left(X^{-1}\right): d x \end{array}$, because $DD'$ or $D'D$ don't give $I$. While $D$ is the duplication matrix, how can $D'$ be called? – yes Jul 31 '20 at 09:05
  • It follows from the cyclic property of the trace and the definition of the Frobenius product that $$A:BC = B^TA:C$$ vec is a shuffling function whose Frobenius product property was used two steps above the one you're asking about, and whose differential property was invoked two steps below. – greg Jul 31 '20 at 13:37
  • I'm working out on the equations. I will let you know soon, whether I have further questions related to what you wrote. – yes Aug 03 '20 at 08:14
  • I have a few more questions/remarks. I opened here a chat room, in order to continue possible conversations and avoid infinite comments: https://chat.stackexchange.com/rooms/111437/function-of-matrix-variable – yes Aug 05 '20 at 08:55
  • I think it should be possible to express the Newton step in a matrix form, but with a tensor-matrix product for the Hessian x gradient multiplication. Do you have a clue on how to convert the Hessian in the half-vec space into a 4th order tensor? – yes Aug 10 '20 at 08:25
  • Hi @greg, do you have some references for the formalisms you used during the computation of the differentials? I got the concept, but I would like to understand them more formally. – yes Oct 28 '20 at 11:35
  • 1
    Here is a nice primer on differentials. If you are more interested in applications to matrices check out Chapter 13 of Matrix Differential Calculus by Magnus and Neudecker. – greg Oct 28 '20 at 13:00
  • I have a similar question here, this time involving symmetric hollow matrices: https://math.stackexchange.com/questions/3909792/vectorization-of-a-symmetric-hollow-matrix . If anyone has some good references, please let me know. Thanks – yes Nov 16 '20 at 15:44
  • Do you happen to know how the operation $D^\top C D$, for a symmetric positive definite $C$ shapes its eigenvalues? I am searching but the only info I could find is only the info about the full column rank of the matrix $D$ – yes Feb 25 '21 at 17:51
  • From that brief description, the eigenvalues of $C$ and $D^TCD$ probably satisfy the law of inertia. – greg Feb 25 '21 at 19:26
  • The law of inertia is about sharing the same number of positive and negative eigenvalues, so for a positive definite matrix just means that they are again non-negative, but this is clear already from the expression of the matrix itself. What I need is to bound the eigenvalues of $D^\top C D$ by knowing that $D$ is the duplication matrix and $C$ is a PSD matrix. I was starting by $|D^\top C D| \leq |D^\top|| C|| D| \leq |D^\top|_F| C|_F| D|_F$. But I am hoping to find a better upper bound. – yes Feb 26 '21 at 10:01
  • I opened a new question, so that the discussion could continue there: https://math.stackexchange.com/questions/4040640/bound-eigenvalues-of-d-top-c-d-for-c-positive-semidefinite – yes Feb 26 '21 at 10:28