short for Root Mean Square Layer Normalization
RMSNorm is a simplification of the original layer normalization
(LayerNorm).
LayerNorm is a regularization technique that might handle the internal covariate shift
issue so as to
stabilize the layer activations and improve model convergence.
It has been proved quite successful in NLP-based model. In some cases,
LayerNorm has become an essential component to enable model optimization, such as in the SOTA NMT model Transformer
.
One application of LayerNorm is on recurrent neural networks. Nonetheless, we observe that LayerNorm raises computational overhead per running step, which diminishes the net efficiency gain from faster and more stable training, as shown in the Figure below.
Training procedure of a GRU-based RNNSearch for the first 10k training steps. Baseline means the original model without any normalization. When the Baseline training loss arrives at 7.0, the loss of LayerNorm reaches 5.4 after the same number of training steps (left figure), but only 5.9 after the same training time (right figure).
RMSNorm simplifies LayerNorm by removing the mean-centering operation, or normalizing layer activations with RMS statistic:
$$ \begin{align} \begin{split} & \bar{a}i = \frac{a_i}{\text{RMS}(\mathbf{a})} g_i, \quad \text{where}~~ \text{RMS}(\mathbf{a}) = \sqrt{\frac{1}{n} \sum{i=1}^{n} a_i^2}. \end{split}\nonumber \end{align} $$
When the mean of the inputs is exactly 0, then LayerNorm equals to RMSNorm. We also observe that the RMS statistic can be estimated from partial inputs, based on the iid assumption. Below shows the comparision of LayerNorm and RMSNorm in different properties.
Weight matrix re-scaling | Weight matrix re-centering | Weight Vector re-scaling | Dataset re-scaling | Dataset re-centering | Single training case re-scaling | |
---|---|---|---|---|---|---|
BatchNorm | ☑ | ☒ | ☑ | ☑ | ☑ | ☒ |
WeightNorm | ☑ | ☒ | ☑ | ☒ | ☒ | ☒ |
LayerNorm | ☑ | ☑ | ☒ | ☑ | ☒ | ☑ |
RMSNorm | ☑ | ☒ | ☒ | ☑ | ☒ | ☑ |
pRMSNorm | ☑ | ☒ | ☒ | ☑ | ☒ | ☑ |
As RMSNorm does not consider the mean of the inputs, it's not re-centering invariant. This is the main difference compared to LayerNorm.
But, does it matter abandoning the re-centering invariant property? or does the re-centering invariant property help improve the robustness of LayerNorm? We did an experiment on RNNSearch with Nematus, where we initialize the weights with a center of about 0.2. The figure below suggests that removing re-centering operation in RMSNorm does not hurt its stability.
SacreBLEU score curve of LayerNorm and RMSNorm on newstest2013 (devset) when the initialization center is 0.2.
The codes rely on the following packages:
- Python2.7
- Numpy
- Tensorflow
- Theano
- Scipy
- lasagne
We provide separate code:
layers.py
: This is used for Theano-related experiments.rmsnorm_tensorflow/torch.py
: an implementation for Tensorflow or Pytorch use.
We did experiments on four different tasks, including different neural models (RNN/CNN/Transformer), different non-linear activations (linear/sigmoid/tanh/relu), different weight initializations(normal/uniform/orthogonal), and different deep learning frameworks (Theano/Pytorch/Tensorflow). Our experiments involve NLP-related and Image-related tasks. Most of the settings follows those in LayerNorm paper. But from our view, we put more focus on machine translation.
The machine translation experiments are based on Nematus(v0.3). To run experiments with RMSNorm:
- clone the github repository and checkout the specific version.
git clone https://github.com/EdinburghNLP/nematus.git cd nematus git checkout tags/v0.3
- change the implementation of
LayerNormLayer
inlayers.py
as follows:class LayerNormLayer(object): def __init__(self, layer_size, eps=1e-5): # TODO: If nematus_compat is true, then eps must be 1e-5! self.new_std = tf.get_variable('new_std', [layer_size], initializer=tf.constant_initializer(1)) self.eps = eps self.layer_size = layer_size def forward(self, x): # m, v = tf.nn.moments(x, axes=[-1], keep_dims=True) # std = tf.sqrt(v + self.eps) # norm_x = (x-m)/std # new_x = norm_x*self.new_std + self.new_mean # return new_x ms = tf.reduce_sum(tf.square(x), axis=-1, keep_dims=True) * 1./self.layer_size norm_inputs = x * tf.rsqrt(ms + self.eps) return norm_inputs * self.new_std
- train your model following the instructions in Nematus, such as wmt17_systems.
To ease the training of RNNSearch, we also provide the used/preprocessed dataset & training script & pretrained model.
- About the Theano-experiments: You can download the Theano-version Nematus
You need chang the
layer_norm
function inlayers.py
as follows:def layer_norm(x, b, s): _eps = numpy_floatX(1e-5) norm_x = tensor.mean(x * x, axis=-1, keepdims=True) output = x / tensor.sqrt(norm_x + _eps) if x.ndim == 3: output = s[None, None, :] * output + b[None, None,:] else: output = s[None, :] * output + b[None,:] return output
b
here is deletable. The training follows a similar way as above.
- The RMSNorm for Transformer is implemented in zero.
We experiment with the bidirectional attentive reader model proposed by Hermann et al. We use the attentive reader model from the repository given by Tim Coojimans et al..
Please follow the steps below:
- Clone the above repository and obtain the data:
wget http://www.cs.toronto.edu/~rkiros/top4.zip
- In codes/att_reader/pkl_data_iterator.py set vdir to be the directory you unzipped the data
- Add the RMSNorm function to
layers.py
incodes/att_reader/layers.py
- Add the rlnlstm_layer and param_init_rlnlstm functions to
layers.py
incodes/att_reader/layers.py
- Add 'rlnlstm': ('param_init_rlnlstm', 'rlnlstm_layer'), to
layers
- Follow the instructions for training a new model
We train the model using the following command
GPUARRAY_FORCE_CUDA_DRIVER_LOAD=True THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,gpuarray.preallocate=0.8 python -u train_attentive_reader.py \
--use_dq_sims 1 --use_desc_skip_c_g 0 --dim 240 --learn_h0 1 --lr 8e-5 --truncate -1 --model "lstm_s1.npz" --batch_size 64 --optimizer "adam" --validFreq 1000 --model_dir $MDIR --use_desc_skip_c_g 1 --unit_type rlnlstm --use_bidir 1
Below are the log files from the model trained using RMSNorm:
wget http://data.statmt.org/bzhang/neurips19_rmsnorm/attentive_reader/stats_lstm_s1.npz.pkl
We experiment with order-embedding model proposed by Vendro et al. The code used is available here.
Please follow the steps below:
- Clone the above repository
- Add the RMSNorm function to
layers.py
in the order-embeddings repo - Add the rlngru_layer and param_init_rlngru functions to
layers.py
in the order-embeddings repo - Add 'rlngru': ('param_init_rlngru', 'rlngru_layer'), to
layers
- In
driver.py
, replace 'encoder': 'gru' with 'encoder': 'rlngru' - Follow the instructons on the main page to train a model
Available below is a download to the model used to report results in the paper:
wget http://data.statmt.org/bzhang/neurips19_rmsnorm/oe/order.npz
wget http://data.statmt.org/bzhang/neurips19_rmsnorm/oe/order.pkl
Once downloaded, follow the instructions on the main page for evaluating models. Notice that
please change the prefix for rlngru
model to lngru
to use the saved models.
We experiment with the ConvPool-CNN-C architecture proposed by Krizhevsky and Hinton, and follow the settings in WeightNorm. We use the implementation here.
Please follow the steps below:
-
Clone the above repository
-
Add RMSNorm function to
nn.py
as follows:class RMSNormLayer(lasagne.layers.Layer): def __init__(self, incoming, b=lasagne.init.Constant(0.), g=lasagne.init.Constant(1.), W=lasagne.init.Normal(0.05), nonlinearity=relu, **kwargs): super(RMSNormLayer, self).__init__(incoming, **kwargs) self.nonlinearity = nonlinearity k = self.input_shape[1] if b is not None: self.b = self.add_param(b, (k,), name="b", regularizable=False) if g is not None: self.g = self.add_param(g, (k,), name="g") if len(self.input_shape)==4: self.axes_to_sum = (2,3) self.dimshuffle_args = ['x',0,'x','x'] else: self.axes_to_sum = 1 self.dimshuffle_args = ['x',0] def get_output_for(self, input, **kwargs): meanS = T.mean(input ** 2,axis=self.axes_to_sum,keepdims=True) norm_input = input / T.sqrt(meanS + 1e-6) if hasattr(self, 'g'): activation = norm_input*self.g.dimshuffle(*self.dimshuffle_args) else: activation = norm_input if hasattr(self, 'b'): activation += self.b.dimshuffle(*self.dimshuffle_args) return self.nonlinearity(activation) def rms_norm(layer, b=lasagne.init.Constant(0.), g=lasagne.init.Constant(1.), **kwargs): nonlinearity = getattr(layer, 'nonlinearity', None) if nonlinearity is not None: layer.nonlinearity = lasagne.nonlinearities.identity if hasattr(layer, 'b'): del layer.params[layer.b] layer.b = None return RMSNormLayer(layer, b, g, nonlinearity=nonlinearity, **kwargs)
-
Add the option into
train.py
:elif args.norm_type=='rms_norm': normalizer = lambda l: nn.rms_norm(l)
-
Download the ciFar dataset and set the train/test_data direction in
train.py
to your ciFar directorywget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
We use the following command to train the model:
GPUARRAY_FORCE_CUDA_DRIVER_LOAD=True THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,gpuarray.preallocate=0.4 python train.py --norm_type rms_norm --learning_rate 0.003
Running los of our model can be downloaded as below:
wget http://data.statmt.org/bzhang/neurips19_rmsnorm/cifar/results.csv
If you find the codes useful, please consider cite the following paper:
Biao Zhang; Rico Sennrich (2019). Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32. Vancouver, Canada.
@inproceedings{zhang-sennrich-neurips19,
address = "Vancouver, Canada",
author = "Zhang, Biao and Sennrich, Rico",
booktitle = "Advances in Neural Information Processing Systems 32",
url = "https://openreview.net/references/pdf?id=S1qBAf6rr",
title = "{Root Mean Square Layer Normalization}",
year = "2019"
}
Please feel free to contact me for any questions about our paper.