`IndexError` in `cross_device_ops` with `MultiWorkerMirroredStrategy` #740

rapsealk · 2024-05-21T04:45:43Z

Hello mlcommons teams!

When using MultiWorkerMirroredStrategy, it has been observed that the cross_device_ops raises an IndexError during optimizer.apply_gradient().

training/image_classification/tensorflow2/resnet_runnable.py

Lines 323 to 324 in 87405ce

    
           self.optimizer.apply_gradients( 
        
               zip(replica_accum_grads, self.training_vars))

https://github.com/tensorflow/tensorflow/blob/64918868e2154b06c7479347a59a4230f785e9fa/tensorflow/python/distribute/cross_device_ops.py#L1140-L1142

Summary:
When utilizing MultiWorkerMirroredStrategy in a distributed training setup, an IndexError is encountered during the execution of optimizer.apply_gradients(), specifically within the cross_device_ops component.

for per_replica in reversed(per_replica_values):
  for i in range(len(self._devices)):
    values_by_device[i].append(per_replica.values[i])

Details:

Environment:
- TensorFlow Version: 2.4.0
- Cluster Setup: Multi-node with 2 nodes
- Strategy: MultiWorkerMirroredStrategy
Issue Description:
During the training process, specifically at the point of executing optimizer.apply_gradients(), an IndexError is raised from the cross_device_ops component. This error disrupts the training workflow, preventing successful completion of the training process across multiple nodes.
Reproduction Steps:
1. Configure the cluster environment with appropriate TF_CONFIG settings for multi-node operation.
2. Initialize MultiWorkerMirroredStrategy within the training script.
3. Execute the training script which involves defining a model, compiling it, and calling model.fit() on the distributed dataset.
4. Observe the occurrence of IndexError during the optimizer.apply_gradients() call.

num_gpus=8
num_workers=2
# $WORKER_ID will be 0 to host0 and 1 to host1.
TF_CONFIG="{\"cluster\": {\"worker\": [\"host0:12345\", \"host1:12345\"]}, \"task\": {\"type\": \"worker\", \"index\": $WORKER_ID}} \
python training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py \
  --distribution_strategy=multi_worker_mirrored \
  --all_reduce_alg=nccl \
  --batch_size=$(( 128 * $num_gpus * $num_workers )) \
  --enable_eager \
  --num_gpus=$num_gpus \
  --lr_schedule=polynomial \
  --optimizer=LARS

Expected Behavior:
The optimizer.apply_gradients() should execute without errors, allowing the training process to proceed correctly across all nodes in the cluster.
Observed Behavior:
An IndexError is raised during the optimizer.apply_gradients() call, originating from the cross_device_ops, which disrupts the training process.

Impact:
This issue prevents the successful execution of distributed training with MultiWorkerMirroredStrategy, hindering the scalability and efficiency of the training process across multiple nodes.

Traceback (most recent call last):
  File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 269, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 262, in main
    stats = run(flags.FLAGS)
  File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 244, in run
    resnet_controller.train(evaluate=not flags_obj.skip_eval)
  File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 257, in train
    train_outputs = self.train_fn(steps_per_loop)
  File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 65, in train
    self.train_loop_fn(self.train_iter, num_steps)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
    raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.impl.api.StagingError: in user code:

    /home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/utils.py:91 loop_fn  *
        step_fn(iterator)
    /home/work/mlperf/training/image_classification/tensorflow2/resnet_runnable.py:350 _apply_grads_and_clear  *
        distribution.extended.call_for_each_replica(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica  **
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:629 _call_for_each_replica
        self._container_strategy(), fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:93 call_for_each_replica
        return _call_for_each_replica(strategy, fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:234 _call_for_each_replica
        coord.join(threads)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:389 join
        six.reraise(*self._exc_info_to_raise)
    /usr/local/lib/python3.6/dist-packages/six.py:703 reraise
        raise value
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:297 stop_on_exception
        yield
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:228 _call_for_each_replica
        **merge_kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/utils.py:152 _all_reduce_sum_fn  **
        grads_and_vars)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2374 batch_reduce_to
        return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:697 _batch_reduce_to
        options=self._communication_options.merge(options))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:426 batch_reduce
        options)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 batch_reduce_implementation
        for value, dest in value_destination_pairs
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 <listcomp>
        for value, dest in value_destination_pairs
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1050 reduce_implementation
        options)[0]
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1103 _batch_all_reduce
        options)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1142 _do_batch_all_reduce_dense
        values_by_device[i].append(per_replica.values[i])

    IndexError: tuple index out of range

Refs

Invalid local_replica_id with MultiWorkerMirroredStrategy #739

The text was updated successfully, but these errors were encountered:

ShriyaPalsamudram · 2024-07-30T20:05:03Z

Sorry but the resnet50 benchmark is dropped from the training benchmarks suite so this issue cannot be addressed at this time.

ShriyaPalsamudram closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`IndexError` in `cross_device_ops` with `MultiWorkerMirroredStrategy` #740

`IndexError` in `cross_device_ops` with `MultiWorkerMirroredStrategy` #740

rapsealk commented May 21, 2024 •

edited

Loading

ShriyaPalsamudram commented Jul 30, 2024 •

edited

Loading

IndexError in cross_device_ops with MultiWorkerMirroredStrategy #740

IndexError in cross_device_ops with MultiWorkerMirroredStrategy #740

Comments

rapsealk commented May 21, 2024 • edited Loading

ShriyaPalsamudram commented Jul 30, 2024 • edited Loading

`IndexError` in `cross_device_ops` with `MultiWorkerMirroredStrategy` #740

`IndexError` in `cross_device_ops` with `MultiWorkerMirroredStrategy` #740

rapsealk commented May 21, 2024 •

edited

Loading

ShriyaPalsamudram commented Jul 30, 2024 •

edited

Loading