Could you please tell me why do we use a learning rate to move into the direction of the derivative to find the minimum? Why is it not good if you simply count it where is it 0?
-
1What do you mean with "Why is it not good if you simply count it where is it 0?"? – stmax Apr 18 '16 at 07:16
3 Answers
Learning rate gives the rate of speed where the gradient moves during gradient descent. Setting it too high would make your path instable, too low would make convergence slow. Put it to zero means your model isn't learning anything from the gradients.

- 3,540
- 2
- 18
- 30
-
-
@user3435407 ??? Setting the derivative to zero means your model is not moving. – SmallChess Apr 18 '16 at 06:47
-
2I think what user3435407 means is setting the derivative of the loss function to zero and solving for the coefficients. That actually works for linear regression and gives the closed form solution β=(XTX)^−1 XTy. It only works for linear regression though - it does not work for logistic regression and most other generalized linear models. – stmax Apr 18 '16 at 07:26
-
-
@stmax Thanks. If that's what he really means, I'll change my answer. – SmallChess Apr 18 '16 at 07:28
-
yes stmax, thats what i meant, sorry Student, for expressing myself clumsy – user3435407 Apr 18 '16 at 19:51
-
Setting the derivative to zero and solving is called the Newton-Raphson method, and it is one of the most popular methods in maximum likelihood modeling. – Apr 18 '16 at 22:47
Use of learning can be understood using image below
Usually cost function used in gradient descent are convex as should in image above. This will be similar also for data with multiple features because for such data we can reason this in similar way one feature at a time.
Let's say we are at point A during training at which point gradient is G, which means cost is increasing fastest in the direction in the direction of G. So we want to move in opposite direction of G with some step size which is learning rate.
As in fig above cost is increasing in direction of negative $w_2$ axis so we want to move in the direction of positive $w_2$ axis. But if we move too much in that direction i.e at point $w_2=6$ then actually cost value has increased. If we always move in same rate then we'll never reach the minimum point.
So we'll need a learning rate which is suitable for this cost function so that it is large enough that we'll have fast descent but low enough that it doesn't shoot other side of the curve

- 141
- 4
The direction is governed by the derivative that we use in the Gradient Descent algorithm. Alpha basically tell how aggressive each step the algorithm makes. If you set alpha = 0.10 , it will take large steps in each iteration of GD than in the case of alpha = 0.01. In other words, alpha determine how large the changes in the parameter are made per iteration.
Gradient Descent Algorithm:
Why is it not good if you simply count it where is it 0?
Setting alpha as zero will make the algorithm learn nothing from the examples.
How to learn alpha ?
It is hit and trail process. You try different values of alpha and plot the graph between the cost ( objective) function and number of iterations performed. Why is it not good if you simply count it where is it 0?
Based on the above graph,aplha= 0.3 cause the GD algorithm to converge in less number of iterations.

- 111
- 2