Fixed Step Methods

“不积跬步,无以至千里” – 荀子

In such methods, only gradient (first order derivative) information is required. It directly apply a ‘fixed’ stepsize along the gradient direction in each step.

Stochastic Gradient Descent: use gradient of a random sampled data in the dataset to update the gradient.

Momentum: use a 1st moment (moving average of the past gradients) to the update step to avoid getting stuck in local optima.

Adagrad: increases the learning rate for sparser parameters. Improves convergence performance in settings where data is sparse and sparse parameters are more informative.

RMSProp: use adaptive learning rate and 2nd moment to update the gradient.

Adam: use a combination of adaptive learning rate, 1st moment and 2nd moment

Reference:

https://emiliendupont.github.io/2018/01/24/optimization-visualization/

https://en.wikipedia.org/wiki/Stochastic_gradient_descent