You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.
For a model with adjacent weight tying, as in section 2.2.1, the gradient goes to NaN after a while.
The model is designed to work in bAbI (1k dataset). I tried lowering the learning rate to 1e-5 from 1e-2, that didn't help.
The parameters are initialized according to section 4.2 of the paper. The weights A,C,T_A(temporal encoding), T_C, are initialized from a gaussian with mean=0 and std=0.1. Number of hops are set to 3. Maximum gradient norm is set to 40. Batch size is 32, and embedding dimension is 40.
During training, gradients of A and T_A becomes NaN after about 10 epochs. This doesn't happen for C and T_C. The learning rate anneals at rate of 0.5 after every 15 epochs.
What can I try to address the NaN in gradients of A and T_A? These weights are used only during the first hop.
On some tasks, we observed a large variance in the performance of our model (i.e. sometimes failing badly, other times not, depending on the initialization). To remedy this, we repeated each training 10 times with different random initializations, and picked the one with the lowest training error.
What were the other initializations that worked for you?
The text was updated successfully, but these errors were encountered:
Is this happening in your implementation? If so, I can't really help. Getting NaN us pretty common, and usually there is a bug somewhere. Have you tried clipping your gradients? That usually helps. About 2, we are talking about the variance in performance, not a numerical instability like NaN. I don't think we got NaN in our implementation.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
For a model with adjacent weight tying, as in section 2.2.1, the gradient goes to NaN after a while.
The model is designed to work in bAbI (1k dataset). I tried lowering the learning rate to 1e-5 from 1e-2, that didn't help.
The parameters are initialized according to section 4.2 of the paper. The weights A,C,T_A(temporal encoding), T_C, are initialized from a gaussian with mean=0 and std=0.1. Number of hops are set to 3. Maximum gradient norm is set to 40. Batch size is 32, and embedding dimension is 40.
During training, gradients of A and T_A becomes NaN after about 10 epochs. This doesn't happen for C and T_C. The learning rate anneals at rate of 0.5 after every 15 epochs.
What were the other initializations that worked for you?
The text was updated successfully, but these errors were encountered: