7. Training Neural Networks II

Optimization

When loss function is only sensitive to specific value
- loss with bad condition number
- slow progress along shallow dimension while jittering along steep direction
  - undesirable behaviour
- much common problem in higher dimensions
Local minimum problem
- hard time getting over certain local minima
- similar in saddle point
  - small or zero gradient
  - more of a problem in real high dimension problem
    - not so common for loss to increase in every dimensions
Only computing of small mini batch
- stochastic
- too expensive to compute entire gradient
- batch can be noisy
- inefficient

maintain velocity over time
- add gradient estimates to the velocity
- step in the direction in velocity instead of gradient
$\rho$: friction
- decay current velocity by $\rho$
- .9 is a common choice
simple idea, solves pretty much all above
- escape local minima or saddle point with velocity
- zigzagging approximations will cancel each other out
  - reduce amount of step in sensitive direction
- noise gets averaged out
  - adds robustness to noise

$v_{t+1}=\rho v_t-\alpha\nabla f(x_t+\rho v_t)$

$x_{t+1}=x_t+v_{t+1}$