As I was reading and going through the second course of Andrew Ng's deep learning course, I came across a sentence that said,
With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).
But how is it possible? can mini-batch gradient descend really give us a better set of weights and biases even though it's not updating them based on the whole dataset? I can only think that it's maybe not overfitting and in that way, it can give better results.