Optimizer | Notion

Gradient Decent (GD)

$$ W:=W-\alpha\frac{\partial}{\partial W} $$

basic optimizer algorithm
updates W by going down the gradient step by step
full batch
Reason to use GD
- why use gradient in steps when you can find the point where gradient is zero of cost(W)
- b/c closed form solution X exist in most non-linear regression
- although exist, too much parameters → GD is more efficient

Stochastic Gradient Decent (SGD)

mini batch
- makes adjustment to weights after each mini-batch
Concept equals to GD

Momentum

$$ v=\alpha v-\eta\frac{\partial L}{\partial W}\\W=W+v $$

Can be applied to SGD
Including previous batch train results in current batch training
- adding acceleration that is calculated and controlled by Momentum to original result

AdaGrad

$$ h=h+\frac{\partial L}{\partial W}\odot\frac{\partial L}{\partial W}\\W=W-\eta\frac{1}{\sqrt{h}}\frac{\partial L}{\partial W} $$

Reduce LR for weights that have changed a lot
Increase LR for weights that have very little or no change
$h$ is added to SGD notation
- $h$ accumulates weight gradients' square