What backpropagation actually is?

Question

I have a conceptual question due to terminology that bothers me. Is backpropagation algorithm a neural network training algorithm or is it just a recursive algorithm in order to calculate a Jacobian for a neural network? Then this Jacobian will be used as part of the main training algorithm e.g Steepest Descent?

Hence is it a training algorithm or a numerical way to calculate a Jacobian matrix (partial derivatives of neural network outputs respective to network parameters)?

score 5 · Answer 1 · answered Sep 08 '21 at 13:03

Backpropagation algorithm is the way the neural network weights are optimized (learned), i.e., what the optimizer uses for this purpose, so yes it can be considered the training algorithm.

In backpropagation, you do not need to explicitely calculate the Jacobian matrix (see this source of info for more detail), but you calculate the first derivatives of your loss function with respect to its weights via the chain rule (from calculus). This chain rule lets us find the gradient of more complex functions by splitting it into simpler derivatives.
Then, by applying the gradient descent strategy (following the steepest descent, i.e., adding the negative derivative at hand for each trainable weight), the weights are adjusted in each iteration until certain values generate a loss which can be accepted or until a certain number of rounds is reached.

As a simple example of how each derivative would be used to update each weight, we can think of linear regression applying gradient descent:

As clearly pointed out in the fantastic book Deep learning with python by François Chollet: "Backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, computing the contribution that each parameter had in the loss value".

I should point out that the quote you reference at the end reworded says "Backprop starts with C(y) (y are the outputs) and works backwards to compute \frac{\partial C}{\partial w} and \frac{partial C}{partial b}" I.e. it just computes gradients. And the linked pdf just computes \frac{\partial L}{\partial x} and makes no mention of what to do with the derivative. I.e. read the first sentence of the second paragraph of chapter 2. "we assume that we are given ∂L/∂y and our goal is to compute ∂L/∂x and ∂L/ ∂w". That's all backprop does. — Derek H, Sep 08 '21 at 15:56

Derek H · Accepted Answer · 2021-09-08T17:04:06.267

3

No, I wouldn't consider backprop a training algorithm. Backpropagation is just a way to find the derivative of the loss function with respect to the inputs by using the chain rule. Computing a derivative doesn't train anything.

What you do with this derivative in order to minimize the loss function is the training part.

EDIT: I think it will depend on who you ask. Take for example, this PyTorch tutorial. They say that "Backward propagation: In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions (gradients), and optimizing the parameters using gradient descent."

I.e. the two steps

loss.backward()
optim.step()

together are what they call backpropagation. This is what I'd call the more engineering view point and I believe is a semantic shift away from what I'd argue (see comments!) is actually backprop and that's just the loss.backward() step.

The semantic drift of backprop meaning calculating the derivatives together with optimization makes sense in this context. Why would you call loss.backward() and then not call optim.step()? But, originally (and technically, the best kind of correct) backprop refers to just the computation of the derivatives and I'll think you'll find that terminology more in math/theory contexts instead of the programming/engineering contexts.

edited Sep 08 '21 at 17:04

answered Sep 08 '21 at 14:42

Derek H

214
1
7

1

it is not an opinion, it is an accepted fact; backpropagation is an algorithm used to train the model, i.e. learn the model weights, etc... Look for well known references :) – German C M Sep 08 '21 at 15:09
2

You are not correct. Backprop merely computes the gradient. Check references for yourself. For example Goodfellow, Bengio, Courville "Backprop allows information from the cost to then flow backward through the network in order to computer the gradient" and "The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi layer NNs. Actually, backprop refers only to the method for computing the gradient, while another algorithm, such as SGD is used to perform learning using this gradient." – Derek H Sep 08 '21 at 15:36
1

Even from the Chollet book you refer to, "Backpropagation is a way to use the derivative of simple operations (such as addition, relu, or tensor product) to easily compute the gradient of arbitrarily complex combinations of these atomic operations." – Derek H Sep 08 '21 at 15:40
The error you are making does appear in well known books. In HOML by Geron, it says when referring to backprop, "In short, it is Gradient Descent using an efficient technique for computing the gradients automatically." and the last step in the algorithm listed there is to "Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed" But the algo in HOML is more accurately called backprop with gradient descent. There's no a priori reason that "backprop" uses gradient descent, or learns anything. – Derek H Sep 08 '21 at 15:46
backpropagation, as a whole, is mentioned as a learning/training algorithm, cited as this in many sources... More references: "These architecture-level parameters are called hyperparameters to distinguish them from the parameters of a model, which are trained via backpropagation." in F. Chollet book. "Automatic differentiation is useful for implementing machine learning algorithms such as backpropagation for training neural networks." cited in https://www.tensorflow.org/guide/autodiff?hl=en... – German C M Sep 08 '21 at 15:52
1

If you click on the backprop link from the tensorflow site it leads to the wiki which agrees with me. "backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example". Continuing, "and does so efficiently, unlike a naive direct computation of the gradient with respect to each weight individually." It's just an efficient way to compute derivatives. Yes, things are trained via (aka using) backprop. As seen in the next sentence of the wiki "This efficiency makes it feasible to use gradient methods for training...". – Derek H Sep 08 '21 at 16:02
1

You can say "I built a deck with a hammer" and it makes sense, but a hammer just drives nails. It doesn't build decks directly. Where you drive the nails doesn't concern the hammer. Backprop computes derivatives. If you use gradient methods to train a network doesn't matter to backprop. Backprop just computes the gradients. – Derek H Sep 08 '21 at 16:04
I see what you mean, and agree with it; nevertheless, the concept of training includes both computing the gradients and applying them in the right direction, and since backpropagation is (in this field) usually applied with gradient descent, I see convenient to consider it the training algo (as it is usually mentioned). Btw, I did not downvote your answer, but I will now upvote it to undo your negative vote ;) – German C M Sep 08 '21 at 16:21

score 1 · Answer 3 · edited Sep 09 '21 at 13:15

1

I interpret the things in away that the terminology has been a bit twisted over several sources. Some refer to the backpropagation as a training algorithm and some just as an efficient algorithm to compute the partial derivatives (i.e the Jacobian).

Hence, I believe the original meaning for backpropagation algorithm was to calculate the Jacobian for the network. Then any training algorithm could utilize the Jacobian. It seems that most often the chosen training algorithm is steepest descent.

edited Sep 09 '21 at 13:15

Stephen Rauch

1,783
11
22
34

answered Sep 09 '21 at 11:52

user3223137

63
4

1

This answer would be better if it included some references; at the moment it just sounds like personal opinion. Either way, I feel it would have been better as part of your question. – Darren Cook Sep 13 '21 at 18:25

What backpropagation actually is?

3 Answers3