2

I wish to compute the derivative and the Hessian of the function ${\bf f}(X)$ where

$$ {\bf f}({\bf X}) = {\bf X} \, {\bf a}, $$

${\bf X}$ is an $(m \times n)$ matrix and ${\bf a}$ is a vector of constants of size $n$. From "Matrix Differential Calculus" by Magnus and Neudecker, it is relatively straightforward to obtain the derivative and the Hessian of ${\bf f}({\bf X})$ w.r.t. ${\bf X}$. For example, the first differential is

\begin{align} \partial {\bf f}({\bf X}) &= (\partial {\bf X}) {\bf a} = {\rm vec} (\partial {\bf X}) {\bf a} = ({\bf a}^\prime \otimes {\bf I}_m) \,\partial {\rm vec} {\bf X} \\ \end{align}

In my case however the matrix ${\bf X}$ has a special structure. Specifically, I know that that ${\bf X}^\prime {\bf X}$ is a diagonal matrix, say, ${\bf X}^\prime {\bf X} = {\rm diag}(d_1, \ldots, d_n)$, which is not necessarily equal to the identity matrix. In other words, I know that the columns of ${\bf X}$ are mutually orthogonal, but each column vector can be of any length. How do I compute the derivative and the Hessian of $f({\bf X})$ w.r.t. ${\bf X}$ to take this structure into account?

Any references on how to proceed would be greatly appreciated.

I could not find a similar question posted before. The closest I found was Hessian of $f(X)$ when $X$ is a symmetric matrix, but I was not able to apply it to my problem.

emakalic
  • 133

1 Answers1

1

$\def\p{\partial}$ Use a semi-orthogonal matrix $Q$ and a vector $y$ to construct $X$ with the desired structure $$\eqalign{ I &= Q^TQ \\ Y &= {\rm Diag}(y) \\ X &= QY &\implies X^TX = YQ^TQY = Y^2 \\ }$$ and define the diagonal matrix $\;A = {\rm Diag}(a)$.

Write the function in terms of these new variables and calculate its gradient $$\eqalign{ f &= Xa = QYa = QAy\\ \p f &= QA\,\p y \\ \frac{\p f}{\p y} &= QA \\ }$$ This can be interpreted as a gradient wrt ${\rm vec}(X)$ by using $$\eqalign{ \p{\rm vec}(X) &= (I\otimes Q)\,{\rm vec}\Big({\rm Diag}(\p y)\Big) \\ }$$ However, since $X$ is structurally constrained its gradient is tricky to interpret and has rather limited utility (e.g. it will destroy the diagonal structure of $X^TX$ if it is used in a gradient descent algorithm).

You might also consider the gradient wrt $Q$, but this matrix is constrained via orthogonality, so its utility is similarly limited. However, you could introduce an unconstrained matrix $U\in{\mathbb R}^{m\times n}$ and use it to construct $$Q = U\left(U^TU\right)^{-1/2} \;\implies\; Q^TQ=I$$ from which the gradient wrt $U$ can be obtained.

greg
  • 35,825