NaN in gradient on A matrix #16

deepaksuresh · 2019-03-13T13:11:09Z

For a model with adjacent weight tying, as in section 2.2.1, the gradient goes to NaN after a while.
The model is designed to work in bAbI (1k dataset). I tried lowering the learning rate to 1e-5 from 1e-2, that didn't help.
The parameters are initialized according to section 4.2 of the paper. The weights A,C,T_A(temporal encoding), T_C, are initialized from a gaussian with mean=0 and std=0.1. Number of hops are set to 3. Maximum gradient norm is set to 40. Batch size is 32, and embedding dimension is 40.
During training, gradients of A and T_A becomes NaN after about 10 epochs. This doesn't happen for C and T_C. The learning rate anneals at rate of 0.5 after every 15 epochs.

What can I try to address the NaN in gradients of A and T_A? These weights are used only during the first hop.

On some tasks, we observed a large variance in the performance of our model (i.e. sometimes failing badly, other times not, depending on the initialization). To remedy this, we repeated each training 10 times with different random initializations, and picked the one with the lowest training error.

What were the other initializations that worked for you?

tesatory · 2019-03-13T13:23:04Z

Is this happening in your implementation? If so, I can't really help. Getting NaN us pretty common, and usually there is a bug somewhere. Have you tried clipping your gradients? That usually helps. About 2, we are talking about the variance in performance, not a numerical instability like NaN. I don't think we got NaN in our implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN in gradient on A matrix #16

NaN in gradient on A matrix #16

deepaksuresh commented Mar 13, 2019

tesatory commented Mar 13, 2019

NaN in gradient on A matrix #16

NaN in gradient on A matrix #16

Comments

deepaksuresh commented Mar 13, 2019

tesatory commented Mar 13, 2019