run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values #82

xwzhong · 2018-11-08T02:56:38Z

hi, I get an InvalidArgumentError when running run_pretraining.py, it shows:

using my own data, I set paraments as follows:
train batch size: 32
max seq length: 64 (99% article less equal 46 word)
max predictions per seq: 10
learning rate: 2e-5

at the begging, I google it, someone said, use smaller learning rate, I find it just delay the coming of InvalidArgumentError, I thought learning rate is not the key reason. alse, I try tf.nn.softmax_cross_entropy_with_logits(labels=y_,logits=y) as it says, saddly, I still get the same error.

tracing the error, (grads, _) = tf.clip_by_gblobal_norm(grads, clip_norm=1.0) -> clip_ops.py line 259 , it shows global_norm calculation error.

what do you think the error happens ? didn't you meet yourself ?

zkailinzhang · 2018-11-09T10:33:21Z

set lower batch_size, will run ok

xwzhong · 2018-11-09T10:40:45Z

@zkl99999 did you know what is the reason make the error happen ?

mleonrivas · 2018-11-11T08:00:19Z

I am seeing the same error.
I tried bumping up the batch size to 8192, however it just delays the error.
lowering the batch size makes it happen faster.
any idea what's happening?

jacobdevlin-google · 2018-11-11T18:25:34Z

I think I just realized what the problem might be, are you guys using a different vocabulary but the same bert_config.json file? The vocabulary size is specified in this file so if a larger vocabulary i used then this will do out-of-bounds lookups (which are unchecked on the GPU or TPU).

If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch?

xwzhong · 2018-11-12T01:45:52Z

fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much.

jacobdevlin-google · 2018-11-12T02:00:44Z

Cool, I will make sure to add this in bold font in the pre-training section of the README.

ohwe · 2019-01-12T16:45:53Z

Hi! I get exactly the same error after global_step=110000 (and therefore, I guess, problem with misconfiguration is very unlikely).

I did shrink my vocabulary to 16k tokens. However, I did fix bert_config.json appropriately and still get the error.

  File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/run_pretraining.py", line 547, in <module>
    tf.app.run()
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/run_pretraining.py", line 509, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
    saving_listeners=saving_listeners
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2186, in _call_model_fn
    features, labels, mode, config)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2470, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1250, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1524, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/run_pretraining.py", line 192, in model_fn
    loss_scale)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/d/in/script/0_script_unpacked/bert_fp16/optimization.py", line 86, in create_optimizer
    (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/ops/clip_ops.py", line 259, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/ops/numerics.py", line 45, in verify_tensor_all_finite
    verify_input = array_ops.check_numerics(t, message=msg)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "/yt/ssd1/hahn-data/slots/0/sandbox/j/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
	 [[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm/global_norm)]]```

eunicechen1987 · 2019-01-16T04:26:59Z

I have the same error.
I use my own vocab with size 51722, and revise it in config.
When is use mixed float from nvidia pr, this error will happened. When is not use mixed float, this error will not happened!

xwzhong · 2019-01-16T13:02:19Z

it's odd, in my experiment, after fixed the config, the error don't happen again(train more than 600w step)

minmummax · 2019-01-18T08:35:27Z

fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much.

so I recently meet this problem too ,how do you fix it ? change the vocab.txt's size or ??

xwzhong · 2019-01-18T08:39:52Z

fantastic, you are right, when I use "the same bert_config.json, but change the vocab file(which makes a gap between bert_config.json vocab size and true vocab size)", the error happens, after fix that, it is gone. thanks much.

so I recently meet this problem too ,how do you fix it ? change the vocab.txt's size or ??

i change the "vocab_size" in bert_config.json.

minmummax · 2019-01-18T08:54:11Z

but I still get this problem after change json file'vocab size that is the same with the vocab file 's size

xwzhong · 2019-01-18T09:07:30Z

but I still get this problem after change json file'vocab size that is the same with the vocab file 's size

for now, I can't give you the reason why it happens, I will check my code, if I get something, I will reply here

minmummax · 2019-01-18T09:08:43Z

but I still get this problem after change json file'vocab size that is the same with the vocab file 's size

for now, I can't give you the reason why it happens, I will check my code, if I get something, I will reply here

all right thanks !

yunchaosuper · 2019-01-20T07:39:47Z

still not understand! you change what parameter ?

{
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128
}

xwzhong · 2019-01-20T07:50:25Z

@yunchaosuper I change "vocab_size"

yunchaosuper · 2019-01-20T12:02:48Z

@xwzhong so you change the vocab_size from 21128 to what? kindly help on that

PeterPanUnderhill · 2019-08-23T07:38:07Z

I think I just realized what the problem might be, are you guys using a different vocabulary but the same bert_config.json file? The vocabulary size is specified in this file so if a larger vocabulary i used then this will do out-of-bounds lookups (which are unchecked on the GPU or TPU).

If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch?

Hi Jacob, I am using pretrained BERT together with other networks, but during finetuning, I also met this problem of NaN global norm. I wonder what do you mean by out-of-bounds lookups? The dataset I use does have OOV words, but what causes NaN global norm? Only when all the tokens in the sentence are unknown words?

Thanks in advance.

brightmart · 2019-09-04T11:40:07Z

@ohwe were you able to solve the problem? after 110000 steps, NaN error happend.

brightmart · 2019-09-04T16:27:28Z

I am pre-training Bert with large amount of data, after 110000 steps, loss is around 1.4
But after stop the trained and try to resume again( by set init point and restore from checkpoint of 110000), NaN error happend. I run serveral times, in other times, Nan came from other layer, not layer_0.

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
Gradient for bert/encoder/layer_0/output/dense/bias:0 is NaN : Tensor had NaN values
[[{{node CheckNumerics_18}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_pretraining.py", line 497, in
tf.app.run()

Can some one help, or have any idea?

@xwzhong @zkl99999 @mleonrivas @jacobdevlin-google @ohwe

shirayu · 2020-01-20T04:20:10Z

I faced the same error (Tensor had NaN values) in the begging of learning.
It was caused by invalid learning rate.
I wrongly set it to 5e5 instead of 5e-5.

It might be related to learning rate.

Yiwen-Yang-666 · 2021-09-17T08:34:40Z

I think I just realized what the problem might be, are you guys using a different vocabulary but the same bert_config.json file? The vocabulary size is specified in this file so if a larger vocabulary i used then this will do out-of-bounds lookups (which are unchecked on the GPU or TPU).

If you are using the same vocabulary, are you starting from the BERT checkpoint or from scratch?

I add a tensor for additional vocabulary embedding, concatenated with original embedding tensor for tokens in the original vocab file, and problem solved

The reason might be that "tf.gather" method in "embedding_lookup" function filters additional vocabulary ids and thus there is no embedding for additional vocabulary's tokens. Predicting masked additional tokens is not matched with those input tokens.

xwzhong changed the title ~~pretrain-clip gradient error: found Inf or NaN global norm: Tensor had NaN values~~ run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values Nov 8, 2018

xwzhong closed this as completed Nov 13, 2018

aswin-giridhar mentioned this issue Apr 16, 2019

Model Hyper Parameters to change after pretraining on the custom dataset #580

Closed

qiutzh mentioned this issue Jan 21, 2021

模型训练过程中出现loss为nan的情况 EdisonChen0816/ner_toolkit#1

Open

This was referenced Sep 17, 2021

Can one expand the vocabulary for fine-tuning by replacing foreign unicode characters? #419

Open

How to use my own additional vocabulary dictionary? #396

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values #82

run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values #82

xwzhong commented Nov 8, 2018 •

edited

Loading

zkailinzhang commented Nov 9, 2018

xwzhong commented Nov 9, 2018

mleonrivas commented Nov 11, 2018

jacobdevlin-google commented Nov 11, 2018

xwzhong commented Nov 12, 2018

jacobdevlin-google commented Nov 12, 2018

ohwe commented Jan 12, 2019 •

edited

Loading

eunicechen1987 commented Jan 16, 2019

xwzhong commented Jan 16, 2019

minmummax commented Jan 18, 2019

xwzhong commented Jan 18, 2019

minmummax commented Jan 18, 2019

xwzhong commented Jan 18, 2019

minmummax commented Jan 18, 2019

yunchaosuper commented Jan 20, 2019

xwzhong commented Jan 20, 2019

yunchaosuper commented Jan 20, 2019

PeterPanUnderhill commented Aug 23, 2019

brightmart commented Sep 4, 2019

brightmart commented Sep 4, 2019 •

edited

Loading

shirayu commented Jan 20, 2020

Yiwen-Yang-666 commented Sep 17, 2021

run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values #82

run_pretraining.py - clip gradient error: Found Inf or NaN global norm: Tensor had NaN values #82

Comments

xwzhong commented Nov 8, 2018 • edited Loading

zkailinzhang commented Nov 9, 2018

xwzhong commented Nov 9, 2018

mleonrivas commented Nov 11, 2018

jacobdevlin-google commented Nov 11, 2018

xwzhong commented Nov 12, 2018

jacobdevlin-google commented Nov 12, 2018

ohwe commented Jan 12, 2019 • edited Loading

eunicechen1987 commented Jan 16, 2019

xwzhong commented Jan 16, 2019

minmummax commented Jan 18, 2019

xwzhong commented Jan 18, 2019

minmummax commented Jan 18, 2019

xwzhong commented Jan 18, 2019

minmummax commented Jan 18, 2019

yunchaosuper commented Jan 20, 2019

xwzhong commented Jan 20, 2019

yunchaosuper commented Jan 20, 2019

PeterPanUnderhill commented Aug 23, 2019

brightmart commented Sep 4, 2019

brightmart commented Sep 4, 2019 • edited Loading

shirayu commented Jan 20, 2020

Yiwen-Yang-666 commented Sep 17, 2021

xwzhong commented Nov 8, 2018 •

edited

Loading

ohwe commented Jan 12, 2019 •

edited

Loading

brightmart commented Sep 4, 2019 •

edited

Loading