Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError in cross_device_ops with MultiWorkerMirroredStrategy #740

Closed
rapsealk opened this issue May 21, 2024 · 1 comment
Closed

IndexError in cross_device_ops with MultiWorkerMirroredStrategy #740

rapsealk opened this issue May 21, 2024 · 1 comment

Comments

@rapsealk
Copy link
Member

rapsealk commented May 21, 2024

Hello mlcommons teams!

When using MultiWorkerMirroredStrategy, it has been observed that the cross_device_ops raises an IndexError during optimizer.apply_gradient().

self.optimizer.apply_gradients(
zip(replica_accum_grads, self.training_vars))

https://github.com/tensorflow/tensorflow/blob/64918868e2154b06c7479347a59a4230f785e9fa/tensorflow/python/distribute/cross_device_ops.py#L1140-L1142

Summary:
When utilizing MultiWorkerMirroredStrategy in a distributed training setup, an IndexError is encountered during the execution of optimizer.apply_gradients(), specifically within the cross_device_ops component.

for per_replica in reversed(per_replica_values):
  for i in range(len(self._devices)):
    values_by_device[i].append(per_replica.values[i])

Details:

  • Environment:

    • TensorFlow Version: 2.4.0
    • Cluster Setup: Multi-node with 2 nodes
    • Strategy: MultiWorkerMirroredStrategy
  • Issue Description:
    During the training process, specifically at the point of executing optimizer.apply_gradients(), an IndexError is raised from the cross_device_ops component. This error disrupts the training workflow, preventing successful completion of the training process across multiple nodes.

  • Reproduction Steps:

    1. Configure the cluster environment with appropriate TF_CONFIG settings for multi-node operation.
    2. Initialize MultiWorkerMirroredStrategy within the training script.
    3. Execute the training script which involves defining a model, compiling it, and calling model.fit() on the distributed dataset.
    4. Observe the occurrence of IndexError during the optimizer.apply_gradients() call.
num_gpus=8
num_workers=2
# $WORKER_ID will be 0 to host0 and 1 to host1.
TF_CONFIG="{\"cluster\": {\"worker\": [\"host0:12345\", \"host1:12345\"]}, \"task\": {\"type\": \"worker\", \"index\": $WORKER_ID}} \
python training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py \
  --distribution_strategy=multi_worker_mirrored \
  --all_reduce_alg=nccl \
  --batch_size=$(( 128 * $num_gpus * $num_workers )) \
  --enable_eager \
  --num_gpus=$num_gpus \
  --lr_schedule=polynomial \
  --optimizer=LARS
  • Expected Behavior:
    The optimizer.apply_gradients() should execute without errors, allowing the training process to proceed correctly across all nodes in the cluster.

  • Observed Behavior:
    An IndexError is raised during the optimizer.apply_gradients() call, originating from the cross_device_ops, which disrupts the training process.

Impact:
This issue prevents the successful execution of distributed training with MultiWorkerMirroredStrategy, hindering the scalability and efficiency of the training process across multiple nodes.

Traceback (most recent call last):
  File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 269, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 262, in main
    stats = run(flags.FLAGS)
  File "/home/work/mlperf/training/image_classification/tensorflow2/resnet_ctl_imagenet_main.py", line 244, in run
    resnet_controller.train(evaluate=not flags_obj.skip_eval)
  File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/controller.py", line 257, in train
    train_outputs = self.train_fn(steps_per_loop)
  File "/home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/standard_runnable.py", line 65, in train
    self.train_loop_fn(self.train_iter, num_steps)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 871, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
    *args, **kwds))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
    raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.impl.api.StagingError: in user code:

    /home/work/mlperf/training/image_classification/tensorflow2/tf2_common/training/utils.py:91 loop_fn  *
        step_fn(iterator)
    /home/work/mlperf/training/image_classification/tensorflow2/resnet_runnable.py:350 _apply_grads_and_clear  *
        distribution.extended.call_for_each_replica(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2730 call_for_each_replica  **
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:629 _call_for_each_replica
        self._container_strategy(), fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:93 call_for_each_replica
        return _call_for_each_replica(strategy, fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:234 _call_for_each_replica
        coord.join(threads)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:389 join
        six.reraise(*self._exc_info_to_raise)
    /usr/local/lib/python3.6/dist-packages/six.py:703 reraise
        raise value
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py:297 stop_on_exception
        yield
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_run.py:228 _call_for_each_replica
        **merge_kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/utils.py:152 _all_reduce_sum_fn  **
        grads_and_vars)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2374 batch_reduce_to
        return self._batch_reduce_to(reduce_op, value_destination_pairs, options)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py:697 _batch_reduce_to
        options=self._communication_options.merge(options))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:426 batch_reduce
        options)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 batch_reduce_implementation
        for value, dest in value_destination_pairs
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1094 <listcomp>
        for value, dest in value_destination_pairs
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1050 reduce_implementation
        options)[0]
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1103 _batch_all_reduce
        options)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/cross_device_ops.py:1142 _do_batch_all_reduce_dense
        values_by_device[i].append(per_replica.values[i])

    IndexError: tuple index out of range

Refs

Hits

@ShriyaPalsamudram
Copy link
Contributor

ShriyaPalsamudram commented Jul 30, 2024

Sorry but the resnet50 benchmark is dropped from the training benchmarks suite so this issue cannot be addressed at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants