Optimization
Problem with SGD
- When loss function is only sensitive to specific value
- loss with bad condition number
- slow progress along shallow dimension while jittering along steep direction
- much common problem in higher dimensions
- Local minimum problem
- hard time getting over certain local minima
- similar in saddle point
- small or zero gradient
- more of a problem in real high dimension problem
- not so common for loss to increase in every dimensions
- Only computing of small mini batch
- stochastic
- too expensive to compute entire gradient
- batch can be noisy
- inefficient
Mommentum
- maintain velocity over time
- add gradient estimates to the velocity
- step in the direction in velocity instead of gradient
- $\rho$: friction
- decay current velocity by $\rho$
- .9 is a common choice
- simple idea, solves pretty much all above
- escape local minima or saddle point with velocity
- zigzagging approximations will cancel each other out
- reduce amount of step in sensitive direction
- noise gets averaged out
Nesterov Momentum
$v_{t+1}=\rho v_t-\alpha\nabla f(x_t+\rho v_t)$
$x_{t+1}=x_t+v_{t+1}$
- step in the direction of velocity
- evaluate the gradient in that point
- go back and step in the mix of the two
- can correct wrong velocity directions a bit more
- useful when convex optimization