-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values #82
Comments
set lower batch_size, will run ok |
@zkl99999 did you know what is the reason make the error happen ? |
I am seeing the same error. |
I think I just realized what the problem might be, are you guys using a different vocabulary but the same If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch? |
fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much. |
Cool, I will make sure to add this in bold font in the pre-training section of the README. |
Hi! I get exactly the same error after global_step=110000 (and therefore, I guess, problem with misconfiguration is very unlikely). I did shrink my vocabulary to 16k tokens. However, I did fix bert_config.json appropriately and still get the error.
|
I have the same error. |
it's odd, in my experiment, after fixed the config, the error don't happen again(train more than 600w step) |
so I recently meet this problem too ,how do you fix it ? change the vocab.txt's size or ?? |
i change the "vocab_size" in bert_config.json. |
but I still get this problem after change json file'vocab size that is the same with the vocab file 's size |
for now, I can't give you the reason why it happens, I will check my code, if I get something, I will reply here |
all right thanks ! |
still not understand! you change what parameter ? { |
@yunchaosuper I change "vocab_size" |
@xwzhong so you change the vocab_size from 21128 to what? kindly help on that |
Hi Jacob, I am using pretrained BERT together with other networks, but during finetuning, I also met this problem of NaN global norm. I wonder what do you mean by out-of-bounds lookups? The dataset I use does have OOV words, but what causes NaN global norm? Only when all the tokens in the sentence are unknown words? Thanks in advance. |
@ohwe were you able to solve the problem? after 110000 steps, NaN error happend. |
I am pre-training Bert with large amount of data, after 110000 steps, loss is around 1.4 Traceback (most recent call last): Can some one help, or have any idea? @xwzhong @zkl99999 @mleonrivas @jacobdevlin-google @ohwe |
I faced the same error ( It might be related to learning rate. |
I add a tensor for additional vocabulary embedding, concatenated with original embedding tensor for tokens in the original vocab file, and problem solved The reason might be that "tf.gather" method in "embedding_lookup" function filters additional vocabulary ids and thus there is no embedding for additional vocabulary's tokens. Predicting masked additional tokens is not matched with those input tokens. |
hi, I get an InvalidArgumentError when running run_pretraining.py, it shows:
using my own data, I set paraments as follows:
train batch size: 32
max seq length: 64 (99% article less equal 46 word)
max predictions per seq: 10
learning rate: 2e-5
at the begging, I google it, someone said, use smaller learning rate, I find it just delay the coming of InvalidArgumentError, I thought learning rate is not the key reason. alse, I try tf.nn.softmax_cross_entropy_with_logits(labels=y_,logits=y) as it says, saddly, I still get the same error.
tracing the error, (grads, _) = tf.clip_by_gblobal_norm(grads, clip_norm=1.0) -> clip_ops.py line 259 , it shows global_norm calculation error.
what do you think the error happens ? didn't you meet yourself ?
The text was updated successfully, but these errors were encountered: