4

To the experts in linear algebra out here-- my background in linear algebra emphasized doing matrix derivatives by computing the Frechet derivative. Now I have posted a question before:Making sense of matrix derivative formula for determinant of symmetric matrix as a Fréchet derivative?

And others have also observed: Understanding notation of derivatives of a matrix

that the formulae for derivatives in matrix cookbook have some implicit trickery in them.

So my question is, can I derive matrix cookbook formulae in general by following the calculation of a Frechet derivative, or are those formulae not reconcilable ? I know that the ML, Stats and EE communities use it a lot, so I want to make sure I am not missing out on a valuable resource. I also invite the purists to weigh in on this.

me10240
  • 1,215
  • 10
  • 27

2 Answers2

4

There is a serious flaw in the formulae reported in the The Matrix Cookbook for gradient with respect to symmetric matrices. (section 2.8 of the Matrix Cookbook, on Structured Matrices).
The arxiv paper cited in another answer has now been published https://doi.org/10.1007/s13226-022-00313-x (shareit link https://rdcu.be/cUwmr), and reading it will give a full picture.

I am the first author and I wanted to put up an answer that will hopefully educate other people who come here with the same questions as I did. In my opinion, the whole confusion arises because somehow the idea of constrained optimization has got mixed up with something that I can only call constrained derivative. I am using derivative and gradient interchangeably here but that doesn't change things.

Let's consider an example. What is the derivative of $f(x, y) = x^2 + 2y^2$ given that the domain is the set of points lying on the unit circle C , $x^2 + y^2 = 1$, i.e, we require the derivative of $f( \mathbf{x})|_C$ ? If instead of asking what is the derivative I had asked a maxima/minima question then we talk of incorporating the unit circle constraint through a Lagrange multiplier etc. This is similar to all the talk of "off-diagonal elements are not independent", "symmetric matrix satisfies a constraint" etc.

However, in the context of taking a derivative, the formula needs $f(\mathbf{x} + \mathbf{h}) - f(\mathbf{x})$ to make sense, i.e., both $\mathbf{x}$ and $\mathbf{x}+\mathbf{h}$ need to belong to whatever is the domain of definition. Here that is the boundary of the unit circle. Clearly, taking $\mathbf{x} = (x, y), \mathbf{h} = (h, k)$, that is not true and the derivative definition has no meaning. However, if we parametrize the unit circle constraint as $x = \sin(\theta), y= \cos(\theta)$, now the argument is $\theta$ and any perturbation of it like $\theta + \delta \theta$ also lies on the unit circle. Now the derivative restricted to the unit circle C can be evaluated from $f( \mathbf{x})|_C = g(\theta) = \sin^2(\theta) + 2\cos^2(\theta)$. The point I am making is that special procedures are necessary only when the domain is not a subspace.

Now consider matrix functionals like $\phi(A)$ for $A$ symmetric, diagonal, upper triangular etc and the derivative of $\phi$. The analogy to the unit circle constraint should now be clear. The Matrix Cookbook and the so-called matrix calculus attempt special procedures for all these cases, even though no special procedures are necessary -- In all these cases, $A, A+H$ belong to the domain of definition.

The exception is when $A$ is orthogonal. In that case, $A+H$ need not be orthogonal for $A, H$ orthogonal and the set of orthogonal matrices need to be parametrized in some way to construct the restriction of $\phi$ to the domain of orthogonal matrices and then evaluate the derivative.

In the paper linked above we show how the end result obtained after the special procedures is incorrect for symmetric matrices.

me10240
  • 1,215
  • 10
  • 27
1

my background in linear algebra emphasized doing matrix derivatives by computing the Frechet derivative.

can I derive matrix cookbook formulae in general by following the calculation of a Frechet derivative, or are those formulae not reconcilable ?

Given a map between two Banach spaces one calculates its Frechet derivative, a linear map at a given point. Then in a special case of a functional on an inner-product space one can introduce the gradient vector. An ongoing confusion, on this site too, arises from mixing up derivative and gradient. The book you mention seems to suffer from it too. For example, when they write $\frac {\partial a^Tx}{\partial x}=a$ they don't mean the derivative $D(a^Tx)=a^T$ anymore but the gradient vector coming from the dot product. I guess that converting your honest derivative to the gradient should lead to the reconciliation.

rych
  • 4,205