I am running in circles trying to derive the formula for the hessian of a composition of (real-valued) vector functions, and can't figure out where I'm going wrong.
Suppose we have (smooth-enough) functions $f:\mathbb{R}^n\longrightarrow\mathbb{R}^m$, $g:\mathbb{R}^m\longrightarrow\mathbb{R}^p$, and their composition $h=g\circ f:\mathbb{R}^n\longrightarrow\mathbb{R}^p$.
Here is what I'm working with:
- I am very confident that the Jacobians $\nabla_x f(x)$ is an $(m\times n)$ matrix and $\nabla_u g(f(x))$ is a $(p\times m)$ matrix, given these are basically by definition the linearization of the resp. functions at the given points $x, f(x)=u$.
- I am fairly confident that the Hessian $\nabla^2 h$ of a given vector-valued function $h:\mathbb{R}^a\rightarrow\mathbb{R}^b$ is a $(b\times a\times a)$ 3-tensor (I'm not quite 100% confident on the actual dimension ordering but from what I've found online and what I can convince myself of, this makes the most sense)
- I am very confident in the vector-valued chain rule: $\nabla_x(g\circ f)=\nabla_u g(f(x))\cdot \nabla_x f(x)$, where here $\cdot$ in the RHS is standard matrix multiplication giving $[p\times m]\cdot[m\times n]\mapsto[p\times n]$
- I am confident that the second derivative of a composition (of real-valued, single-variable functions, for simplicity) is $\frac{d^2}{dx^2}g(f(x))=g''(f(x))f'(x)^2+g'(f(x))f''(x)$ since you have to apply the product rule the second time through and the chain rule again applies to the first term of the product rule.
Now, if I try to (albeit naively) apply all these things together (and this is where it gets dicey), I would expect the result to look something like $$ \nabla^2_x h(x)=\nabla_x^2[g(f(x))] = \nabla^2_ug(f(x))\cdot\nabla_xf(x)^2+\nabla_ug(f(x))\cdot\nabla^2_xf(x). \ \ \ (1) $$
This doesn't directly translate though because $\nabla_xf(x)$ isn't square, so it can't be squared (pun intended). I am guessing there is some transposition magic happening during the product rule?
At any rate, if I try to work backward however, knowing that $\nabla_xh$ needs to be a $(p\times n\times n)$ and writing out what the individual sizes need to be, I get
$$ [p\times m\times m]([m\times n]?[m\times n]?) + [p\times m][m\times n\times n] $$
The second term works out nicely just by summing over the innermost $m-$dim resulting in a $(p\times n\times n)$ 3-tensor as expected, but I cannot see how any transposition of the first set of terms can result in a $(p\times n\times n)$ tensor. Unless of course we multiply out each of the $\nabla_xf$ factors against one of the $m$ dimensions in the $\nabla_u^2g$ tensor. Which I guess kind of makes sense since $\nabla_u^2g$ should be basically $p$ "layers" of bilinear forms so that its like $\nabla_xf^T[\nabla_u^2g]_i\nabla_xf$, but that all seems very hand-wavy to me, esp since we don't have to apply the same logic to the second term (possibly this comes back to the product rule?)
Anyway... tldr: what is the correct (with correct justification "something something product rule..."?) method of evaluating this tensor product (the first term in the rhs of "equation" $(1)$)? Is my description of multiplying out the second term accurate? What would be good/sensible/correct notation to use for the first term since the $\nabla^2_ug(f(x))\cdot\nabla_xf(x)^2$ shown seems like a bad abuse of notation?