- Weight Initialization big issue in deep learning training
- wrong initialization leads to local minimum problem
Zero Uniform
- all weights are initialized to zero
- all neurons will end up having equal value
- back-propagation will update all neurons with same value
- decreases the meaning of separating layers
Normal Distribution
- sigmoid activation function causes gradient vanishing problems
- suppose weights are initialized by normal distribution with mean=0, std=1
- because std is too big, weights will be biased to either 0 or 1
- gradient vanishing problem
- such problem can be solved by choosing different activation function (relu)
- also can be solved by correctly initializing weights
- suppose we decreased the std to 0.01
- as layer gets deeper, gets more biased to 0.5
- not solved
LeCun Initialization
$$
W\sim N(0,Var(W))\\Var(w)=\sqrt{\frac{1}{n_{in}}}\\LeCun\ Normal\ Initialization
$$
$$
W\sim U(-\sqrt{\frac{1}{n_{in}}},+\sqrt{\frac{1}{n_{in}}})\\LeCun\ Uniform\ Initialization
$$
- LeCun is the inventor of LeNet
- LeCun initialization was originally for efficient back propagation
- Not the most preferred method nowadays
Xavier Initialization (Glorot Initialization)
$$
W\sim N(0,Var(W))\\Var(W)-\sqrt{\frac{2}{n_{in}+n_{out}}}\\Xavier\ Normal\ Initialization
$$
$$
W\sim U(-\sqrt{\frac{6}{n_{in}+n_{out}}},+\sqrt{\frac{6}{n_{in}+n_{out}}})\\Xavier\ Uniform\ Initialization
$$
- structure similar to LeCun's
- next layer's node number is influenced
- optimized constant value is claimed