Consider a function space $\mathcal{F}$ and functions $F : \mathbb{R}^P \to \mathcal{F}$ and $C : \mathcal{F} \to \mathbb{R}$. Now, the derivative of $C \circ F$ w.r.t. $\theta$ can be written down by means of the chain rule (using Fréchet derivatives/differentials):
$$\mathrm{d}(C \circ F)(\theta) = \mathrm{d}C(f_\theta) \circ \mathrm{d}F(\theta)$$
where $f_\theta = F(\theta)$.
Now, we want to optimise the parameters $\theta$ to minimise the function $C \circ F$. Using gradient descent for optimisation, the following relation should hold:
$$\mathrm{d}\theta_p(t)(1) = -\mathrm{d}(C \circ F)(\theta)(e_p),$$
with $\mathrm{d}\theta_p$ the partial derivative of $\theta \in \mathbb{R}^P$ w.r.t. the $p$-th component, and thus change to the function $f_\theta$ is given by
$$\mathrm{d}f_\theta(t) = \mathrm{d}F(\theta_t) \circ \mathrm{d}\theta(t).$$
This piece of mathematics comes from a paper that I roughly understand. Also the general idea behind the mathematics above is clear to me. However, there are some technicalities that cause some troubles for me:
- Is there any difference between differentials and derivatives and if yes, is there a way to express the difference(s) and similarit(y)(ies)? In this post it is stated that derivative and differential are different things. Also wikipedia states something in the same lines. However, the Fréchet derivative seems to be a differential to my untrained eye and when looking at the page for total derivative, derivative and differential are used interchangeably.
- Is there a way to express $\mathrm{d}\theta(t)$ directly, given the expression for the partials $\mathrm{d}\theta_p(t)$? In an attempt to do it by myself, I had written something like $$\underbrace{\mathrm{d}\theta(t)}_{\mathbb{R} \to \mathbb{R}^P} = -\underbrace{\mathrm{d}(C \circ F) (\theta)}_{\mathbb{R}^P \to \mathbb{R}},$$ which obviously can not work. I realised that I might need a gradient rather than a derivative (I somehow always thought that gradients and derivatives are synonyms), but when using wikipedia as a reference again, a gradient can only be defined for functions $f : \mathbb{R}^N \to \mathbb{R}$. Although $C \circ F$ should have a gradient with this definition, it is unclear how it would be defined when applied to the chain rule. Again, I would be inclined to write something like $$\underbrace{\mathop{\nabla}(C \circ F)(\theta)}_{\mathbb{R} \to \mathbb{R}^P} = \underbrace{\mathop{\nabla}C(f_\theta)}_{\mathbb{R} \to \mathcal{F}} \circ \underbrace{\mathop{\nabla}F(\theta)}_{\mathcal{F} \to \mathbb{R}^P},$$ but that probably makes no sense again mathematically.
Any help or insights would be greatly appreciated.