Can mini-batch gradient descent outperform batch gradient descent?

Question

As I was reading and going through the second course of Andrew Ng's deep learning course, I came across a sentence that said,

With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

But how is it possible? can mini-batch gradient descend really give us a better set of weights and biases even though it's not updating them based on the whole dataset? I can only think that it's maybe not overfitting and in that way, it can give better results.

score 0 · Answer 1 · answered Jul 30 '20 at 07:48

0

First of all though SGD gives frequent updates which help improve performance of the model but it can result in noisy result and frequent updates is computationally expensive for large dataset.

Batch gradient has less update frequency which result is stable error gradient but as it stores whole training data the learning process is slow(when data is large).

Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.

Mini batch GD allows to not have all training data in memory and can be performed in distributed manner.

For more info you can check chapter 8 of Deep Learning book by Ian Goodfellow.

answered Jul 30 '20 at 07:48

prashant0598

1,501
1
12
21

Hi, thank you for your answer! but I still have difficulty understanding how mini-batch can give us better results than batch gradient descent? let's say you've gone through all the epochs, at the end will we have a lower cost? – mitra mirshafiee Jul 30 '20 at 08:18
Batch gradient descent can bring you the possible optimal gradient given all your data, it is not the true gradient though. "When using an extremely large training set, overﬁtting is not an issue, so underﬁtting and computational eﬃciency become the predominant concerns." – prashant0598 Jul 30 '20 at 08:27

Can mini-batch gradient descent outperform batch gradient descent?

1 Answers1