Weight Initialization

Zero Uniform

all weights are initialized to zero
all neurons will end up having equal value
back-propagation will update all neurons with same value
- decreases the meaning of separating layers

sigmoid activation function causes gradient vanishing problems
suppose weights are initialized by normal distribution with mean=0, std=1
- because std is too big, weights will be biased to either 0 or 1
  - gradient vanishing problem
- such problem can be solved by choosing different activation function (relu)
  - also can be solved by correctly initializing weights
suppose we decreased the std to 0.01
- as layer gets deeper, gets more biased to 0.5
- not solved

$$ W\sim N(0,Var(W))\\Var(w)=\sqrt{\frac{1}{n_{in}}}\\LeCun\ Normal\ Initialization $$

$$ W\sim U(-\sqrt{\frac{1}{n_{in}}},+\sqrt{\frac{1}{n_{in}}})\\LeCun\ Uniform\ Initialization $$

$$ W\sim N(0,Var(W))\\Var(W)-\sqrt{\frac{2}{n_{in}+n_{out}}}\\Xavier\ Normal\ Initialization $$

$$ W\sim U(-\sqrt{\frac{6}{n_{in}+n_{out}}},+\sqrt{\frac{6}{n_{in}+n_{out}}})\\Xavier\ Uniform\ Initialization $$

structure similar to LeCun's
- next layer's node number is influenced
- optimized constant value is claimed