Deriving the hessian of a composition of vector-valued functions

Question

I am running in circles trying to derive the formula for the hessian of a composition of (real-valued) vector functions, and can't figure out where I'm going wrong.

Suppose we have (smooth-enough) functions $f:\mathbb{R}^n\longrightarrow\mathbb{R}^m$, $g:\mathbb{R}^m\longrightarrow\mathbb{R}^p$, and their composition $h=g\circ f:\mathbb{R}^n\longrightarrow\mathbb{R}^p$.

Here is what I'm working with:

I am very confident that the Jacobians $\nabla_x f(x)$ is an $(m\times n)$ matrix and $\nabla_u g(f(x))$ is a $(p\times m)$ matrix, given these are basically by definition the linearization of the resp. functions at the given points $x, f(x)=u$.
I am fairly confident that the Hessian $\nabla^2 h$ of a given vector-valued function $h:\mathbb{R}^a\rightarrow\mathbb{R}^b$ is a $(b\times a\times a)$ 3-tensor (I'm not quite 100% confident on the actual dimension ordering but from what I've found online and what I can convince myself of, this makes the most sense)
I am very confident in the vector-valued chain rule: $\nabla_x(g\circ f)=\nabla_u g(f(x))\cdot \nabla_x f(x)$, where here $\cdot$ in the RHS is standard matrix multiplication giving $[p\times m]\cdot[m\times n]\mapsto[p\times n]$
I am confident that the second derivative of a composition (of real-valued, single-variable functions, for simplicity) is $\frac{d^2}{dx^2}g(f(x))=g''(f(x))f'(x)^2+g'(f(x))f''(x)$ since you have to apply the product rule the second time through and the chain rule again applies to the first term of the product rule.

Now, if I try to (albeit naively) apply all these things together (and this is where it gets dicey), I would expect the result to look something like $$ \nabla^2_x h(x)=\nabla_x^2[g(f(x))] = \nabla^2_ug(f(x))\cdot\nabla_xf(x)^2+\nabla_ug(f(x))\cdot\nabla^2_xf(x). \ \ \ (1) $$

This doesn't directly translate though because $\nabla_xf(x)$ isn't square, so it can't be squared (pun intended). I am guessing there is some transposition magic happening during the product rule?

At any rate, if I try to work backward however, knowing that $\nabla_xh$ needs to be a $(p\times n\times n)$ and writing out what the individual sizes need to be, I get

$$ [p\times m\times m]([m\times n]?[m\times n]?) + [p\times m][m\times n\times n] $$

The second term works out nicely just by summing over the innermost $m-$dim resulting in a $(p\times n\times n)$ 3-tensor as expected, but I cannot see how any transposition of the first set of terms can result in a $(p\times n\times n)$ tensor. Unless of course we multiply out each of the $\nabla_xf$ factors against one of the $m$ dimensions in the $\nabla_u^2g$ tensor. Which I guess kind of makes sense since $\nabla_u^2g$ should be basically $p$ "layers" of bilinear forms so that its like $\nabla_xf^T[\nabla_u^2g]_i\nabla_xf$, but that all seems very hand-wavy to me, esp since we don't have to apply the same logic to the second term (possibly this comes back to the product rule?)

Anyway... tldr: what is the correct (with correct justification "something something product rule..."?) method of evaluating this tensor product (the first term in the rhs of "equation" $(1)$)? Is my description of multiplying out the second term accurate? What would be good/sensible/correct notation to use for the first term since the $\nabla^2_ug(f(x))\cdot\nabla_xf(x)^2$ shown seems like a bad abuse of notation?

Forget matrices and all these adhoc layouts. Working straight with linear and bilinear maps makes everything clear; see the first part of my answer here for the second derivative of a composition, i.e your formula (1) is essentially right (once you somehow make sense of the various layout issues… which won’t even arise if you stick to the clean and efficient multilinear algebra formulation of things). — peek-a-boo, Jan 04 '24 at 06:22
@peek-a-boo That is a very elegant and helpful explanation, thank you! Alas, my actual use case is writing code to evaluate the hessian of an arbitrary number of composed functions (deep nn, specifically) so I do need to live in tensor layout land, but your explanation very much helps me wrap my head around how/supposed to perform the tensor mults, so thank you again! — charelstoncrabb, Jan 04 '24 at 21:14

Deriving the hessian of a composition of vector-valued functions

0 Answers0