You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to start training the model by using the default configuration file for quora. This has use_cudnn=true. But it has run into some unexpected error, when I run the SentenceMatchTrainer.py file. The error is as follows:
(tensorflowGPU) D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src>python SentenceMatchTrainer.py --config_path "../configs/quora.sample.config"
Loading the configuration from ../configs/quora.sample.config
{'train_path': '../data/quora/train.tsv', 'dev_path': '../data/quora/dev.tsv',
'word_vec_path': '../data/quora/wordvec.txt', 'model_dir': 'quora_model', 'suffix': 'quora', 'fix_word_vec': True, 'isLower': True, 'max_sent_length': 50, 'max_char_per_word': 10,
'with_char': True, 'char_emb_dim': 20, 'char_lstm_dim': 40, 'batch_size': 60, 'max_epochs': 20, 'dropout_rate': 0.1, 'learning_rate': 0.0005, 'optimize_type': 'adam', 'lambda_l2': 0.0,
'grad_clipper': 10.0, 'context_layer_num': 1, 'context_lstm_dim': 100,
'aggregation_layer_num': 1, 'aggregation_lstm_dim': 100, 'with_full_match': True, 'with_maxpool_match': False, 'with_max_attentive_match': False, 'with_attentive_match': True,
'with_cosine': True, 'with_mp_cosine': True, 'cosine_MP_dim': 5, 'att_dim': 50, 'att_type': 'symmetric', 'highway_layer_num': 1,
'with_highway': True, 'with_match_highway': True,
'with_aggregation_highway': True, 'use_cudnn': True, 'with_moving_average': False}
Collecting words, chars and labels ...
Number of words: 104891
Number of chars: 1198
word_vocab shape is (106686, 300)
Number of labels: 2
Build SentenceMatchDataStream ...
Number of instances in trainDataStream: 384348
Number of batches in trainDataStream: 6406
Number of instances in devDataStream: 10000
Number of batches in devDataStream: 167
2019-05-30 00:41:22.120164: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-05-30 00:41:23.282409: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.48
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2019-05-30 00:41:23.289931: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2019-05-30 00:41:25.325066: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-30 00:41:25.329970: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929] 0
2019-05-30 00:41:25.332505: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0: N
2019-05-30 00:41:25.337204: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4740 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
return fn(*args)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1305, in _run_fn
self._extend_graph()
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:
Colocation members and user-requested devices:
Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
Model/global_norm/L2Loss_38 (L2Loss)
[[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "SentenceMatchTrainer.py", line 257, in <module>
main(FLAGS)
File "SentenceMatchTrainer.py", line 191, in main
sess.run(initializer)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
run_metadata_ptr)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
run_metadata)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:
Colocation members and user-requested devices:
Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
Model/global_norm/L2Loss_38 (L2Loss)
[[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]
Caused by op 'Model/global_norm/L2Loss_38', defined at:
File "SentenceMatchTrainer.py", line 257, in <module>
main(FLAGS)
File "SentenceMatchTrainer.py", line 175, in main
is_training=True, options=FLAGS, global_step=global_step)
File "D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src\SentenceMatchModelGraph.py", line 10, in __init__
self.create_model_graph(num_classes, word_vocab, char_vocab, is_training, global_step=global_step)
File "D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src\SentenceMatchModelGraph.py", line 175, in create_model_graph
grads, _ = tf.clip_by_global_norm(grads, self.options.grad_clipper)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 240, in clip_by_global_norm
use_norm = global_norm(t_list, name)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 179, in global_norm
half_squared_norms.append(gen_nn_ops.l2_loss(v))
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4679, in l2_loss
"L2Loss", t=t, name=name)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
op_def=op_def)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:
Colocation members and user-requested devices:
Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
Model/global_norm/L2Loss_38 (L2Loss)
[[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]
When I set use_cudnn:false, the training starts without any problems. In this case, it is still using the GPU. I understand from the code that use_cudnn=true helps make use of the CudnnLSTM, but maybe the issue arises due to OS or the Tensorflow version. The details of the environment are:
OS : Windows10
Python: 3.6.8
Tensorflow_GPU version: 1.8
GPU: GTX 1060 6 GB
Can you tell where the problem lies ? In the meantime, I'll try to run this the program with default configs on an Ubuntu machine and see the results. Thanks !
The text was updated successfully, but these errors were encountered:
TLfERLS
changed the title
Error while training while training with CuDNN arg set as True
Error while training with CuDNN arg set as True
May 29, 2019
I tried to start training the model by using the default configuration file for quora. This has
use_cudnn=true
. But it has run into some unexpected error, when I run theSentenceMatchTrainer.py
file. The error is as follows:When I set
use_cudnn:false
, the training starts without any problems. In this case, it is still using the GPU. I understand from the code thatuse_cudnn=true
helps make use of theCudnnLSTM
, but maybe the issue arises due to OS or the Tensorflow version. The details of the environment are:OS : Windows10
Python: 3.6.8
Tensorflow_GPU version: 1.8
GPU: GTX 1060 6 GB
Can you tell where the problem lies ? In the meantime, I'll try to run this the program with default configs on an Ubuntu machine and see the results. Thanks !
The text was updated successfully, but these errors were encountered: