Gradient Decent (GD)
$$
W:=W-\alpha\frac{\partial}{\partial W}
$$
- basic optimizer algorithm
- updates W by going down the gradient step by step
- full batch
- Reason to use GD
- why use gradient in steps when you can find the point where gradient is zero of cost(W)
- b/c closed form solution X exist in most non-linear regression
- although exist, too much parameters → GD is more efficient
Stochastic Gradient Decent (SGD)
- mini batch
- makes adjustment to weights after each mini-batch
- Concept equals to GD
Momentum
$$
v=\alpha v-\eta\frac{\partial L}{\partial W}\\W=W+v
$$
- Can be applied to SGD
- Including previous batch train results in current batch training
- adding acceleration that is calculated and controlled by Momentum to original result
AdaGrad
$$
h=h+\frac{\partial L}{\partial W}\odot\frac{\partial L}{\partial W}\\W=W-\eta\frac{1}{\sqrt{h}}\frac{\partial L}{\partial W}
$$
- Reduce LR for weights that have changed a lot
- Increase LR for weights that have very little or no change
- $h$ is added to SGD notation
- $h$ accumulates weight gradients' square