1

I am trying to understand the chain rule applied to a series of transformations in the context of the back propagation algorithm for deep learning. Let $x \in \mathbb{R^k}$ and $A,B$ be real-value matrices of size $K \times K$. Then consider a network defined as $$y = Ax$$ $$u = \sigma (y)$$ $$v = Bx$$ $$z = A (u * v)$$ $$w = Az$$ $$ L = ||w||^2$$

where $L$ is considered as a function of $x, A, B$, and $u*v$ is the element-wise product, and $\sigma(y)$ is the element-wise application of the sigmoid function to $y$. Now I want to be able to calculate $\frac{\partial L }{\partial A}$ and $\frac{\partial L }{\partial B}$.

From what I understand $\frac{\partial L }{\partial A} = \frac {\partial {L}}{\partial w} \frac {\partial w} {\partial A}$

I'm not sure how to express $\frac{\partial w} {\partial A}$ since $z$ is a function of $A$. My guess would be something like $\frac {\partial w}{\partial A} = \frac{d}{dA} (Az) + A \frac{d}{dA} (z)$ but I am not sure if this step should be an application of the product rule or the chain rule.

I'm also not sure how to express $\frac {\partial z} {\partial A}$. Any insights appreciated

3 Answers3

1

The first thing to do is to draw correctly the underlying computation graph, and then apply the chain rule according to that graph.

The following is the chain rule that you should remember:

The derivative of the output with respect to a node can be computed from the derivatives of all its children as follows: $\newcommand{\dv}[1]{\operatorname{d}\!{#1}}$ \begin{equation} \frac{\dv{L}}{\dv{x_i}} = \sum_{j\in\mathrm{Children}(i)} \frac{\partial x_j}{\partial x_i} \frac{\dv{L}}{\dv{x_j}}. \end{equation}

Therefore, the chain rule applied to node $A$ gives $$\frac{\dv{L}}{\dv{A}} = \frac{\dv{L}}{\dv{w}}\frac{\partial w}{\partial A} + \frac{\dv{L}}{\dv{z}}\frac{\partial z}{\partial A} + \frac{\dv{L}}{\dv{y}}\frac{\partial y}{\partial A}.$$

The only unknown quantities in the above are $\frac{\dv{L}}{\dv{z}}$ and $\frac{\dv{L}}{\dv{y}}$, which can be computed using the above chain rule again applied to the nodes $z$ and $y$, respectively. This is precisely how backpropagation works.

Check my answer here for a more detailed explanation: https://math.stackexchange.com/a/3865685/31498. You should be able to fully understand backpropagation after reading that.

f10w
  • 4,509
  • 15
  • 22
  • Thanks for your help, but how do I compute $\frac {\partial w} {\partial A}$ ? – IntegrateThis Sep 29 '21 at 16:49
  • 1
    @IntegrateThis Obviously $\frac{\partial w}{\partial A} = z$ because $w=Az$. – f10w Sep 29 '21 at 18:20
  • $A$ is a matrix, so $\frac{\partial w} {\partial A}$ has $K*K^2$ entries, but $z$ is a vector of shape $(1, K)$ ? – IntegrateThis Sep 29 '21 at 18:23
  • 1
    @IntegrateThis Sorry I was in a hurry. Yes you are right, the derivative should be a tensor. But you get the idea. It's easy to compute it, just do it element by element. (I'm cooking, will get back later to help if you still need it...) – f10w Sep 29 '21 at 18:31
  • I will write up a full answer to how to compute the desired partials later today. This post has been very helpful tho I think I understand what I was confusing earlier. – IntegrateThis Sep 29 '21 at 18:38
  • @IntegrateThis Great. Looks like you have understood it. +1 for your effort. – f10w Sep 30 '21 at 08:19
1

$$\frac{dL}{dA} = \frac{dL}{dw}\frac{\partial w}{\partial A} + \frac{dL}{dz}\frac{\partial z}{\partial A} + \frac{dL}{dy}\frac{\partial y}{\partial A}.$$

And $$\frac{dL}{dz} = \frac{dL}{dw} \frac{\partial w}{ \partial z}$$ $$ \frac{dL}{dy} = \frac{dL}{du} \frac{\partial u}{ \partial y} $$ $$ \frac{dL}{du} = \frac{dL}{dz} \frac{\partial z}{ \partial u} $$

Further, $$\frac{dL}{dB} = \frac{dL}{dv} \frac {\partial v} {\partial B}$$ and $$\frac{dL} {dv} = \frac{dL}{dz} \frac {\partial z} {\partial v}$$

0

$ \def\s{\sigma} \def\qiq{\quad\implies\quad} \def\LR#1{\left(#1\right)} \def\c#1{\color{red}{#1}} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\cgrad#1#2{\c{\grad{#1}{#2}}} $Calculate the differential of each of the variables in your list $$\eqalign{ y &= Ax &\qiq dy = dA\:x + A\:dx \\ u &= \s(y) \\ U &= {\rm Diag}(u) &\qiq du = (U - U^2)\:dy \\ v &= Bx &\qiq dv = dB\:x + B\:dx \\ z &= A (u\odot v) \\ V &= {\rm Diag}(v) &\qiq dz = dA\:Vu + AV\:du + AU\:dv \\ w &= Az &\qiq dw = dA\:z + A\:dz \\ L &= \|w\|^2 &\qiq dL = 2w^Tdw \\ }$$ Then start at the last differential and back-substitute to the first $$\eqalign{ dL &= 2w : dw \\ \tfrac 12\:dL &= w : dw \\ &= w : \LR{dA\:z + A\:dz} \\ &= wz^T:dA + \LR{A^Tw}:dz \\ &= wz^T:dA + \LR{A^Tw}:\LR{dA\:Vu + AV\:du + AU\:dv} \\ &= wz^T:dA + {A^Twu^TV}:dA + VA^TA^Tw:du + UA^TA^Tw:dv \\ \\ }$$ This is getting absurdly long, so define a few variables before continuing $$\eqalign{ P &= wz^T+A^Twu^TV, \qquad q = VA^TA^Tw, \qquad r = UA^TA^Tw \\ \\ \tfrac 12\:dL &= P:dA + r:dv + q:du \\ &= P:dA + r:\LR{dB\:x + B\:dx} + q:(U - U^2)\:dy \\ &= P:dA + \LR{rx^T:dB + B^Tr:dx} + (U-U^2)q:dy \\ &= P:dA + rx^T:dB + B^Tr:dx + (U-U^2)q:\LR{dA\:x + A\:dx} \\ &= \LR{P+(U - U^2)qx^T}\c{:dA} + rx^T\c{:dB} + \LR{B^Tr+A^T(U-U^2)q}\c{:dx} \\ }$$ Now the desired gradients can be easily identified $$\eqalign{ \cgrad{L}{A} &= {2P+2(U - U^2)qx^T}, \quad \cgrad{L}{B} &= 2rx^T, \quad \cgrad{L}{x} &= {2B^Tr + 2A^T(U-U^2)q} \\ \\ }$$


The Frobenius product $(:)$ is extraordinarily useful in Matrix Calculus $$\eqalign{ \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ When applied to vectors $(n=\tt1)$ it reduces to the standard dot product.

The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many fruitful ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$

greg
  • 35,825