Questions tagged [gradient-descent]

Gradient Descent is an algorithm for finding the minimum of a function. It iteratively calculates partial derivatives (gradients) of the function and descends in steps proportional to those partial derivatives. One major application of Gradient Descent is fitting a parameterized model to a set of data: the function to be minimized is an error function for the model.

Gradient descent is a first-order iterative optimization algorithm. It is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

Gradient descent is also known as steepest descent, or the method of steepest descent.

https://en.wikipedia.org/wiki/Gradient_descent

450 questions

votes

2 answers

Duplicated features for gradient descent

Suppose that our data matrix X has a duplicated column, i.e, there is a duplicated feature and the matrix is not full column rank. What happpens? I guess that we can not find a unique solution because that's the case for the close form in linear…

gradient-descent

asked Jan 26 '20 at 16:36

southernKid33

votes

1 answer

Is the magnitude of the gradient a weakness of Gradient Descent?

The formula for Gradient Descent is as follows: $$ \mathbf{w} := \mathbf{w} - \alpha\; \triangledown C $$ The gradient itself points in the direction of steepest ascent, therefore it is logical to go in the opposite direction by subtracting it. But…

gradient-descent

asked Dec 26 '18 at 02:26

oezguensi

votes

4 answers

How to fit a math formula to data?

I have math formula and some data and I need to fit the data to this model. The math is like. $y(x) = ax^k + b$ and I need to estimate the $a$ and the $b$. I have triet gradient descend to estimate these params but it seems that is somewhat time…

gradient-descent

asked Dec 06 '22 at 12:55

Mahdi Amrollahi

votes

2 answers

Gradient Checking LSTM - how to get change in Cost across timesteps?

I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows: 01 01 01 01 ^ ^ ^ ^ LSTM --> LSTM --> LSTM --> LSTM ^ ^ ^ ^ 11 11 11 …

gradient-descent

asked Apr 27 '18 at 04:42

Kari

2,726
2
20
49

votes

2 answers

How do local minima occur in the equation of loss function?

In gradient descent, I know that local minima occur when the derivative of a function is zero, but when the loss function is used, the derivative is equal to zero only when the output and the predicted output are the same (according to the equation…

gradient-descent

asked Jul 27 '20 at 08:51

AI_new2

votes

1 answer

Why L2 norm in AdaGrad update equation not L1?

The update equation of AdaGrad is as follows: I understand that sparse features have small updates and this is a problem. I understand that the idea of AdaGrad is to make the update speed (learning rate) of a parameter is inversely proportional to…

gradient-descent

asked Jul 07 '20 at 13:09

Osama El-Ghonimy

votes

1 answer

How to determine the convergence of Stochastic Gradient Descent?

While coding the batch gradient descent, it is easy to code the convergence as after each iterations the cost moves towards minimum and when the change in cost tends to approach a pre-defined number, we stop the iterations and conclude our gradient…

gradient-descent

asked Nov 27 '19 at 18:15

Mayank Mishra

votes

1 answer

Calculating the average of gradient decent

I am currently studying the backpropagation process and gradient decent algorithm form the book Neural Networks and Deep Learning written by Michael Nielsen and 3Blue1Brown channel in YouTube. My question is about calculating the gradient in…

gradient-descent

asked Nov 06 '19 at 02:04

Morti

votes

2 answers

Calculating derivative of error at point x with respects to weight w_j

I don't know how the equation below goes from line 2 to 3 after the derivative term is moved inside the brackets. Specifically, how is it calculating the derivative of log(y_hat)? Also, if anyone can point to a good textbook or website to learn…

gradient-descent

asked Mar 16 '19 at 11:59

mLstudent33

votes

1 answer

Intuition behind using the inverse of a Hessian matrix for automatically estimating the learning rate (aggression parameter) in gradient descent.

I am reviewing some course material where the lecturer suggests that instead of guessing the learning rate parameter in gradient descent implementation, one could use the inverse of the Hessian multiplied by the negative of the Jacobian, to…

gradient-descent

asked Dec 26 '18 at 22:43

sirisha sunkara

votes

1 answer

Stochastic gradient descent in matrix factorization, sensitive to label's scale?

I'm trying to figure out a strange phenomenon, when I use matrix factorization (the Netflix Prize solution) for a rating matrix: $R = P^T * Q + B_u + B_i$ with ratings ranging from 1 to 10. Then I evaluate the model by each label's absolute mean…

gradient-descent

asked Dec 22 '14 at 13:04

zihaolucky

votes

1 answer

How to apply Gradient Descent to a summ of function

My target is to find a center of a circle that approximate a set of dots i want to find minimum of a function: $$\sum_{i=0}^N (\sqrt{(x_i - a)^2 + (y_i - b)^2} - R)^2$$ this function represent an error of a my approximation of a set a dots on a…

gradient-descent

asked Jun 21 '18 at 13:04

stronk_kisik

votes

1 answer

Why does Ensemble Averaging actually improve results?

Why does ensemble averaging work for neural networks? This is the main idea behind things like dropout. Consider an example of a hypersurface defined by the following image (white means lowest Cost). We have two networks: yellow and red, each…

gradient-descent

asked Apr 27 '18 at 22:46

Kari

2,726
2
20
49

votes

1 answer

Should I consider feature scaling for all gradient descent based algorithms?

In the Coursera course machine learning in the section on Multivariate Linear Regression, Andrew Ng provides the following tips on gradient descent: Use Feature Scaling to converge quicker Get feature into an approx -1 < x < 1 range Mean…

gradient-descent

asked Feb 27 '18 at 08:48

Chris Snow

votes

1 answer

Why does a horizontal cross section of a squared error surface yield ellipsis?

Also, can someone please explain why the descent happens in a direction perpendicular to the contour lines?

gradient-descent

asked Feb 20 '17 at 13:58

user1274878

2 Next