Fixed Step Methods
“不积跬步,无以至千里” – 荀子
In such methods, only gradient (first order derivative) information is required. It directly apply a ‘fixed’ stepsize along the gradient direction in each step.
Stochastic Gradient Descent: use gradient of a random sampled data in the dataset to update the gradient.
Momentum: use a 1st moment (moving average of the past gradients) to the update step to avoid getting stuck in local optima.
Adagrad: increases the learning rate for sparser parameters. Improves convergence performance in settings where data is sparse and sparse parameters are more informative.
RMSProp: use adaptive learning rate and 2nd moment to update the gradient.
Adam: use a combination of adaptive learning rate, 1st moment and 2nd moment
Reference:
https://emiliendupont.github.io/2018/01/24/optimization-visualization/
https://en.wikipedia.org/wiki/Stochastic_gradient_descent