Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Batch size check breaks training with --train_cudnn true due to leftover Tensors in the check graph #2110

Closed
reuben opened this issue Feb 11, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@reuben
Copy link
Collaborator

reuben commented Feb 11, 2022

D Session closed.
I Dummy run finished without problems, now starting real training process.
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/code/training/coqui_stt_training/train.py", line 723, in <module>
    main()
  File "/code/training/coqui_stt_training/train.py", line 693, in main
    train()
  File "/code/training/coqui_stt_training/train.py", line 332, in train
    train_impl(epochs=Config.epochs, silent_load=True)
  File "/code/training/coqui_stt_training/train.py", line 390, in train_impl
    iterator, optimizer, dropout_rates
  File "/code/training/coqui_stt_training/train.py", line 172, in get_tower_results
    iterator, dropout_rates, reuse=i > 0
  File "/code/training/coqui_stt_training/train.py", line 90, in calculate_mean_edit_distance_and_loss
    batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl
  File "/code/training/coqui_stt_training/deepspeech_model.py", line 232, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/code/training/coqui_stt_training/deepspeech_model.py", line 135, in rnn_impl_cudnn_rnn
    inputs=x, sequence_lengths=seq_length
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in converted code:
    relative to /usr/local/lib/python3.6/dist-packages/tensorflow_core:

    contrib/cudnn_rnn/python/layers/cudnn_rnn.py:440 call
        training)
    contrib/cudnn_rnn/python/layers/cudnn_rnn.py:518 _forward
        seed=self._seed)
    contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py:1132 _cudnn_rnn
        outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
    python/ops/gen_cudnn_rnn_ops.py:2051 cudnn_rnnv3
        time_major=time_major, name=name)
    python/framework/op_def_library.py:367 _apply_op_helper
        g = ops._get_graph_from_inputs(_Flatten(keywords.values()))
    python/framework/ops.py:5979 _get_graph_from_inputs
        _assert_same_graph(original_graph_element, graph_element)
    python/framework/ops.py:5914 _assert_same_graph
        (item, original_item))

    ValueError: Tensor("cudnn_lstm/opaque_kernel:0", dtype=float32_ref, device=/device:GPU:0) must be from the same graph as Tensor("tower_0/Reshape_2:0", shape=(?, ?, 2048), dtype=float32, device=/device:GPU:0).
@reuben reuben added the bug Something isn't working label Feb 11, 2022
@reuben
Copy link
Collaborator Author

reuben commented Feb 11, 2022

Workaround is to comment out the batch size check:

log_info("Performing dummy training to check for memory problems.")
log_info(
"If the following process crashes, you likely have batch sizes "
"that are too big for your available system memory (or GPU memory)."
)
train_impl(epochs=1, reverse=True, limit=Config.train_batch_size * 3, write=False)
log_info("Dummy run finished without problems, now starting real training process.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant