Bug: Batch size check breaks training with `--train_cudnn true` due to leftover Tensors in the check graph #2110

reuben · 2022-02-11T12:55:22Z

D Session closed.
I Dummy run finished without problems, now starting real training process.
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/code/training/coqui_stt_training/train.py", line 723, in <module>
    main()
  File "/code/training/coqui_stt_training/train.py", line 693, in main
    train()
  File "/code/training/coqui_stt_training/train.py", line 332, in train
    train_impl(epochs=Config.epochs, silent_load=True)
  File "/code/training/coqui_stt_training/train.py", line 390, in train_impl
    iterator, optimizer, dropout_rates
  File "/code/training/coqui_stt_training/train.py", line 172, in get_tower_results
    iterator, dropout_rates, reuse=i > 0
  File "/code/training/coqui_stt_training/train.py", line 90, in calculate_mean_edit_distance_and_loss
    batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl
  File "/code/training/coqui_stt_training/deepspeech_model.py", line 232, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/code/training/coqui_stt_training/deepspeech_model.py", line 135, in rnn_impl_cudnn_rnn
    inputs=x, sequence_lengths=seq_length
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in converted code:
    relative to /usr/local/lib/python3.6/dist-packages/tensorflow_core:

    contrib/cudnn_rnn/python/layers/cudnn_rnn.py:440 call
        training)
    contrib/cudnn_rnn/python/layers/cudnn_rnn.py:518 _forward
        seed=self._seed)
    contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py:1132 _cudnn_rnn
        outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
    python/ops/gen_cudnn_rnn_ops.py:2051 cudnn_rnnv3
        time_major=time_major, name=name)
    python/framework/op_def_library.py:367 _apply_op_helper
        g = ops._get_graph_from_inputs(_Flatten(keywords.values()))
    python/framework/ops.py:5979 _get_graph_from_inputs
        _assert_same_graph(original_graph_element, graph_element)
    python/framework/ops.py:5914 _assert_same_graph
        (item, original_item))

    ValueError: Tensor("cudnn_lstm/opaque_kernel:0", dtype=float32_ref, device=/device:GPU:0) must be from the same graph as Tensor("tower_0/Reshape_2:0", shape=(?, ?, 2048), dtype=float32, device=/device:GPU:0).

The text was updated successfully, but these errors were encountered:

reuben · 2022-02-11T12:55:39Z

Workaround is to comment out the batch size check:

STT/training/coqui_stt_training/train.py

Lines 324 to 331 in 49beaf5

    
           log_info("Performing dummy training to check for memory problems.") 
        
           log_info( 
        
               "If the following process crashes, you likely have batch sizes " 
        
               "that are too big for your available system memory (or GPU memory)." 
        
           ) 
        
           train_impl(epochs=1, reverse=True, limit=Config.train_batch_size * 3, write=False) 
        
           log_info("Dummy run finished without problems, now starting real training process.")

reuben added the bug Something isn't working label Feb 11, 2022

reuben closed this as completed in c1041b9 Feb 11, 2022

HarikalarKutusu mentioned this issue Apr 20, 2022

coqui_stt_training.train never finishes #2195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Batch size check breaks training with `--train_cudnn true` due to leftover Tensors in the check graph #2110

Bug: Batch size check breaks training with `--train_cudnn true` due to leftover Tensors in the check graph #2110

reuben commented Feb 11, 2022

reuben commented Feb 11, 2022

Bug: Batch size check breaks training with --train_cudnn true due to leftover Tensors in the check graph #2110

Bug: Batch size check breaks training with --train_cudnn true due to leftover Tensors in the check graph #2110

Comments

reuben commented Feb 11, 2022

reuben commented Feb 11, 2022

Bug: Batch size check breaks training with `--train_cudnn true` due to leftover Tensors in the check graph #2110

Bug: Batch size check breaks training with `--train_cudnn true` due to leftover Tensors in the check graph #2110