Why not initialize neural network weight to zero?

Initialization is one of those details that becomes important when training networks from scratch (but most researchers stopped since BERT). Setting all weights to zero seems intuitive at first, but this creates a problem. For a classifier, uniform initialization would make all classes equiprobable at the start, which is reasonable. The issue runs deeper! Xavier initialization and He initialization are common solutions, and this blurb explains how they address the core problem.

Why not zero (or a constant)?

Consider a simple two-layer network. The input is $ x $, which passes through a weight matrix $ W_1 $ to produce hidden activations $ h = W_1 x $. These hidden activations then pass through another weight matrix $ W_2 $ to produce the output $ y = W_2 h $.

If $ W_1 $ is initialized to zero, then $ h $ is 0 regardless of the input. The output becomes $ y = 0 $. During backpropagation, all hidden units receive identical gradient signals. All weights in $ W_1 $ update by the same amount, so they remain identical after each gradient step.

This example shows a general pattern. When weights are initialized to the same value, they stay synchronized throughout training. In fact, symmetry is a broader problem that goes beyond initialization. Even with proper initialization, deep networks can have fewer effective parameters than their total weight count. Some weights can become (functionality) equivalent during training.

Variance-controlled random initialization

Random initialization breaks the symmetry. However, poor initialization can cause the variance of activations to shrink or grow as signals pass through layers. This can compound during backpropagation, causing gradients to vanish or explode.

Consider what happens when activations pass through a layer. For a single output unit, we compute $ z = \sum_{j=1}^{n_{\text{in}}} w_j a_j $ where $ w_j $ are the weights and $ a_j $ are the input activations. Assuming the inputs are normalized and weights are initialized with zero mean, both have expected value zero. Under this assumption, the variance of a product simplifies to $ \text{Var}(w_j a_j) = \text{Var}(w_j) \text{Var}(a_j) $.

The variance of the output becomes: $$ \text{Var}(z) = \sum_{j=1}^{n_{\text{in}}} \text{Var}(w_j) \text{Var}(a_j) = n_{\text{in}} \cdot \text{Var}(W) \cdot \text{Var}(a) $$ If this product is less than 1, variance shrinks at each layer. If greater than 1, variance grows exponentially.

Xavier initialization sets the weight variance to balance this. For the forward pass to maintain constant variance, the weight variance should be $ \frac{1}{n_{\text{in}}} $. For the backward pass (gradients), it should be $ \frac{1}{n_{\text{out}}} $. Xavier uses the average of these two requirements.

Xavier initialization samples weights from a distribution with variance $$ \text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}} $$ where $ n_{\text{in}} $ and $ n_{\text{out}} $ are the number of neurons in the previous and current layer. This works well for tanh and sigmoid activations.

He initialization uses $$ \text{Var}(W) = \frac{2}{n_{\text{in}}} $$ which compensates for ReLU activations zeroing out roughly half the values.