Questions tagged [gradient-descent]

For questions surrounding gradient descent, a method for finding the optimum state of a parameterized function based on another function often called the loss or error function. It iteratively descends the loss surface to the minimum loss by adjusting parameters based on the product of the partial derivatives comprising the gradient and a learning rate.

The loss function is sometimes called an error function. Its inverse is a wellness or value function.

The intention of each iteration is to decrease the result of applying the loss function. The particular method intended to produce this decrease is to calculate the gradient of the loss function and use it to compute the incremental change in parameters likely to reduce the loss. It is often used in conjunction with back propagation to distribute the corrective signal over a sequence of layers, each of which is parameterized.

To avoid overshooting the optimum, leading to oscillation or chaotic behavior, the corrective signal is attenuated with a factor called the learning rate. Too low a learning rate will compromise the speed of convergence.

Several strategies exist to calculate loss functions, hyper-parameterize the corrective signaling that is back-propagated, or integrate other search strategies to improve either reliability, speed, or accuracy.

204 questions
1
vote
2 answers

Why use learning rate schedules if weight updates automatically decrease when approaching local optimal?

Andrew Ng said in his slide that: However, there are numerous types of 'learning rate schedules' in TensorFlow that change the learning rate profile as training progresses. If it's true that these adjustments are redundant then why do we need…
Wong
  • 11
  • 2
1
vote
2 answers

How would 1D gradient descent look like?

We have always known that gradient descent is a function of two or more variables. But how can we geometrically represent gradient descent if it is a function of only one variable?
Parul S
  • 121
  • 1
  • 2
0
votes
0 answers

Relation between the number of parameters and the features in Gradient descent algorithm

My book describes this as an equation for minimizing the $\theta$ value, but I have a few questions regarding the intuition behind this equation: The book describes $j$ as the number of features. If we have to compute the $\theta$ value for every…
0
votes
1 answer

Numerical problems with gradient descent

I'm trying to implement a simple neural network for classification (multi-class) as an exercise (written in C). During gradient descent, the weights and biases quickly get out of control and the gradient becomes infinite. I haven't been able to find…
martinkunev
  • 255
  • 1
  • 7