On an application of the chain rule

Question

Define a sequence $(\mathbf{y})_{i=0}^N$ in $\mathbb{R}^n$ such that: $$\mathbf{y}_{k+1} = \mathbf{y}_{k} + \lambda \nabla_\mathbf{y} E(\mathbf{y}_k,\mathbf{w}), \quad k=0,1,\ldots,N-1,$$ where $\lambda$ is a constant, $\mathbf{w}\in\mathbb{R}^m$, and $E:\mathbb{R}^{n+m}\to \mathbb{R}$ is some differentiable function.

Let $Q:\mathbb{R}^{n}\to \mathbb{R}$ be a differentiable function and $L=Q(\mathbf{y} _N)$.

Applying the chain rule we have: $$\frac{dL}{d\mathbf{w}} = \sum_{k=1}^N\frac{\partial \mathbf{y}_k^\top}{\partial \mathbf{w}} \frac{dQ}{d\mathbf{y}_k}\qquad (1)$$ and $$\frac{dQ}{d\mathbf{y}_k} = \frac{\partial \mathbf{y}_{k+1}^\top}{\partial \mathbf{y}_{k}} \frac{dQ}{d\mathbf{y}_{k+1}}.\qquad (2)$$

(Source: this paper, equation (12))

My questions:

How to obtain $(1)$? Shouldn't it be $$\frac{dL}{d\mathbf{w}} = \sum_{k=1}^N \frac{d\mathbf{y}_k^\top}{d\mathbf{w}} \frac{\partial Q}{\partial\mathbf{y}_k}?$$ The operators $\partial$ and $d$ seem to be reversed.
I'm again confused about using $\partial$ and $d$ in $(2)$. Could you please explain it?

Thank you very much in advance for your help!

score 0 · Answer 1 · answered Jul 09 '17 at 05:59

Personally, I think they should all be $\partial$, not $d$.

For example, matrix calculus and tensor calculus usually just use $\partial$, I think.

So at least for the two equations you show, Domke is using $d$ for vector derivatives of multivariate scalar functions and $\partial$ otherwise. This sort of makes sense, since you can treat it like 1D function of single variable (that happens to be a vector). However, this pattern is broken in other sections of the paper.

On the other hand, the form you suggest is more like the classic chain rule (e.g. here). However, you can also write the chain rule as all $\partial$ (e.g. here).

But to be honest, they are interchangeable here I think. It only really matters when you need to differentiate between many different types of derivatives (e.g. partial, total, covariant, Lie, exterior, ...) [1,2,3]. There is a notion of total vs partial derivative (e.g. here or here), but it doesn't seem to be what's meant here.

On an application of the chain rule

1 Answers1