Batch Norm and L2 Regularization

logo

Batch Norm and L2 are regularization method that prevent overfitting, and you might think that’s a good idea to use them both.
However, the effect of batch norm will disentangle the penality that L2 is offereing.
It’s okay to use both, and sometimes it does provide better result. But they do not work as regularizer together.

gif

What is Batch Norm

As the name suggests, Batch Normalization achieves this normalization by using the mean and variance of batches of training data. It is used on input before nonlinearity (activation layer).

$Output_{NN}(X; w, b,\alpha,\beta) = activation(\frac{(Xw - \mu)}{\sigma (Xw)}\alpha+\beta)$

where the mean µ and standard deviation σ are computed given a batch X of training data. The extra parameters $\alpha$ and β are needed to still be able to represent all possible ranges of inputs to activation.

What is L2 regularization

L2 regularization is a technique used in the loss function as a penalty term to be minimized.

$L_\lambda(w)=L(w)+\lambda\|w\|_2^2$

Normally there will be a gradient derivation deduction of this loss function here. But I am not gonna bored you with the math :) I think the intuitive of the L2 is pretty straighforward. The sum of square of weights has been added to the loss function as a penalty term to be minimized. So in order to minimize loss, the scale of weight has to be small,weights will decay proportionally towards zero by a small factor of. This is why this technique is also known as “weight decay”.

What happened if they are used at the same time?

What happens when we try to use an L2 objective penalty term with batch normalization present? To first order, the weight decay from the L2 penalty no longer has an influence on the output of the neural net With a little thought, this should not be surprising. Since batch norm makes the output invariant to the scale of previous activations, and the scale of previous activations is linearly related to the scale of the model weights, the output will now be invariant to weight decay’s scaling of those weights.

Normalization, either Batch Normalization, Layer Normalization, or Weight Normalization makes the learned function invariant to scaling of the weights w. This scaling is strongly affected by regularization. We know of no first order gradient method that can fully eliminate this effect. However, a direct solution of forcing $\|w\|$ = 1 solves the problem. By doing this we also remove one hyperparameter from the training procedure.

As noted by Salimans & Kingma (2016), the effect of weight and batch normalization on the effective learning rate might not necessarily be bad. If no regularization is used, then the norm of the weights tends to increase over time, and so the effective learning rate decreases. Often that is a desirable thing, and many training methods lower the learning rate explicitly. However, the decrease of effective learning rate can be hard to control, and can depend a lot on initial steps of training, which makes it harder to reproduce results.

What is Batch Norm

What is L2 regularization

What happened if they are used at the same time?

References