Reverse mode differentiation vs. forward mode differentiation - where are the benefits?

Question

According to Wikipedia forward mode differentiation is preferred when $f: \mathbb{R}^n \mapsto \mathbb{R}^m$, m >> n. I cannot see any computational benefits. Let us take simple example: $f(x,y) = sin(xy)$. We can visualize it as graph with four nodes and 3 edges. Top node is $\sin(xy)$, node one level below is $xy$ and two initial nodes are $x$ and $y$. Derivatives on nodes are $\cos(xy)$, $x$, and $y$. For both reverse and forward mode differentiation we have to compute these derivatives. How is reverse mode differentiation is computationally superior here?

user664303 · Accepted Answer · 2023-09-09T13:14:50.777

The chain rule states that to compute the Jacobian of an operation we should multiply the Jacobians of all sub-operations together. The difference between forward- and reverse-mode auto-differentiation is the order in which we multiply those Jacobians.

In your case you only have two sub-operations: $xy$ and $\sin()$, leading to only one matrix multiplication, so it isn't really instructive. However, let's consider an operation with 3 sub-operations. Take the function: $$ \mathbf{y} = f(\mathbf{x}) = r(q(p(\mathbf{x}))) $$ where $\bf{x}$ and $\bf{y}$ are vectors of different lengths. We can break this down into: $$ \mathbf{a} = p(\mathbf{x}),~~~~ \mathbf{b} = q(\mathbf{a}),~~~~ \mathbf{y} = r(\mathbf{b}). $$ This gives us the Jacobian $$ \underbrace{\frac{\partial \mathbf{y}}{\partial \mathbf{x}}}_{|\mathbf{y}|\times|\mathbf{x}|} = \underbrace{\frac{\partial r(\mathbf{b})}{\partial \mathbf{b}}}_{|\mathbf{y}|\times|\mathbf{b}|} \underbrace{\frac{\partial q(\mathbf{a})}{\partial \mathbf{a}}}_{|\mathbf{b}|\times|\mathbf{a}|} \underbrace{\frac{\partial p(\mathbf{x})}{\partial \mathbf{x}}}_{|\mathbf{a}|\times|\mathbf{x}|}, $$ with the size of each matrix noted below the matrix. The time taken to compute each of those intermediate Jacobians is fixed, but the order in which we multiply them together changes the number of operations required to do so. Forward-mode auto-differentiation would compute $$ \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial r(\mathbf{b})}{\partial \mathbf{b}}\left(\frac{\partial q(\mathbf{a})}{\partial \mathbf{a}}\frac{\partial p(\mathbf{x})}{\partial \mathbf{x}}\right), $$ which involves $|\mathbf{x}|\cdot|\mathbf{a}|\cdot|\mathbf{b}|+|\mathbf{x}|\cdot|\mathbf{b}|\cdot|\mathbf{y}|$ multiplications*. In contrast, reverse-mode auto-differentiation would compute $$ \frac{\partial \bf{y}}{\partial \bf{x}} = \left(\frac{\partial r(\bf{b})}{\partial \bf{b}}\frac{\partial q(\bf{a})}{\partial \bf{a}}\right)\frac{\partial p(\bf{x})}{\partial \bf{x}}, $$ which involves $|\mathbf{y}|\cdot|\mathbf{a}|\cdot|\mathbf{b}|+|\mathbf{y}|\cdot|\mathbf{a}|\cdot|\mathbf{x}|$ multiplications.

Assuming for simplicity that $|\mathbf{a}|=|\mathbf{b}|$, in the case that $|\mathbf{y}|\gt|\mathbf{x}|$, we can see that forward-mode auto-differentiation results in fewer operations. Similarly, if $|\mathbf{y}|\lt|\mathbf{x}|$ then reverse-mode auto-differentiation results in fewer operations.

This means that reverse-mode auto-differentiation (a.k.a. back propagation) will usually be faster when $f: \mathbb{R}^n \mapsto \mathbb{R}^m, m << n$, i.e. the cost function output is low dimensional (e.g. a scalar loss function), but the input is high dimensional (e.g. millions of parameters of a neural network).

This reasoning on whether forward- or reverse-mode is preferable extends to longer chains of Jacobians. However, exceptions can occur, e.g. when the lowest dimensionality of variables occurs neither at the function input or output, but somewhere in between. In such cases, the optimal ordering of matrix multiplications won't be fully forward- or reverse-mode, but a hybrid scheme.

I've discussed theoretical/idealized considerations, but there are practical considerations too. For example, reverse-mode auto-differentiation requires a forward pass through the code to compute values, then a reverse pass to compute the derivatives. A trace of the values needs to be stored during the forward pass, in order to compute the reverse pass. This increases the complexity of implementing and running reverse-mode auto-differentiation.

^{*The number of scalar multiplications required to multiply two matrices of sizes $a\times b$ and $b\times c$ is $a\cdot b\cdot c$.}

What does $|\mathbf{x}|\cdot|\mathbf{a}|\cdot|\mathbf{b}|+|\mathbf{x}|\cdot|\mathbf{b}|\cdot|\mathbf{y}|$ mean? (I didn't understand the notation). Moreover, the $order$ of the computation is first the math inside of the parenthesis, and then the one outside of the parenthesis? — Guilherme Parreira, Aug 13 '20 at 18:07
$|\mathbf{x}|$ is the length of $\mathbf{x}$, i.e. number of elements; $\cdot$ just means multiply. Regarding the order, yes, things inside parentheses are computed first. — user664303, Aug 14 '20 at 19:32
Tks. But I still didn't understand how you obtained $|\mathbf{y}|\cdot|\mathbf{a}|\cdot|\mathbf{b}|+|\mathbf{y}|\cdot|\mathbf{a}|\cdot|\mathbf{x}|$ from the chain rule ($\frac{\partial \bf{y}}{\partial \bf{x}} = \left(\frac{\partial h(\bf{b})}{\partial \bf{b}}\frac{\partial g(\bf{a})}{\partial \bf{a}}\right)\frac{\partial f(\bf{x})}{\partial \bf{x}}$). — Guilherme Parreira, Aug 14 '20 at 21:22
@GuilhermeParreira I have updated the question to explain this. — user664303, Aug 28 '20 at 20:05

score 10 · Answer 2 · answered Jul 06 '21 at 01:15

An analogy might help. Let $\bf A$, $\bf B$, and $\bf C$ be matrices with dimensions such that $\bf ABC$ is well defined. There are two obvious ways to compute this product, represented by $(\bf AB)\bf C$ and $\bf A(\bf BC)$. Which of those will require fewer multiplications and additions depends on the dimensions of the matrices. For example, if $\bf C$ has width 1 then the second form will be faster, or at least no slower. It's efficient to multiply by thin or short matrices early.

If $\bf A$, $\bf B$, and $\bf C$ correspond to Jacobians of various steps in the computation graph, then I believe $\bf C$ having width $1$ corresponds to the case when $m=1$.

It's efficient to multiply by thin or short matrices early.

This finally clicks! — Rounak Datta, May 16 '23 at 09:09

Reverse mode differentiation vs. forward mode differentiation - where are the benefits?

2 Answers2

Linked