-
Notifications
You must be signed in to change notification settings - Fork 74.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorFlow 2.13 distributed training fail #61314
Comments
I have the same issue |
Hi @nikita-savelyevv , From your attached code snippet i have changed this line:
When |
I have also tested the same code snippet attached by @nikita-savelyevv and found program hangs when devices=2 started executing. Logas attached below.
Tested code attached here as colab gist. Thanks! |
@SuryanarayanaY Thanks for reaching out! I used I agree with your results. For me, when order of devices is set to Since the machine you run the code on has 4 GPUs, I would suppose that setting the order to something like Anyway, I would assume that these two problems (hanging and throwing error) are related and may have the same cause. |
Adding to distributed training hanging with Small fashion mnist example to reproduce jit_compiled model fails to train and hangs:
|
Same issue here! Using 4GPU for distributed training on Ubuntu 22.04 with Tensorflow 2.13 hangs at the “compiled cluster using the XLA” line. Issue solved by downgrading to 2.12 |
Same for me with RTX 8000 and A6000 setup in MirrorStrategy with NCCL and Hierarchical CrossDeviceOps. I get a huge block of Would be nice to get some feedback from the team if reproducible. |
Generally it's not a good idea to create multiple
|
I am facing similar issue described by @xinyu-dev. Ubuntu 22.04 and Tensorflow 2.13.0, but running from docker image using
|
You may refer to my issue in #62234 . Currently, using RING instead of NCCL is a temporary workaround (https://github.com/edwardyehuang/iSeg/blob/master/utils/distribution_utils.py). Another workaround (2.13 only atm), use Besides, if anyone has a tf-nightly GPU wheel on Mar 17 and April 27, please share it with me so I can test it and see if the pull request #60001 or #59424 cause this issue. |
Another thing worth to attention: Why the third-party (conda-forge) conda build can avoid this issue? (the TensorFlow in the docker image is directly installed from pip) |
same issue here. anyone finds solutions? |
Upgrade the NVIDIA driver >= 545 and the issue should be addressed |
I got a similar error on the NVIDIA GPU while using tensorflow-federated and nest_asyncio packages. The error appeared when I updated tensorflow-federated package version from 0.38.0 to 0.73.0. I tried updating nest_asyncio, but it didn't help. So I just muted that message.
Check for details. |
Similar error. Fixed it by removing the steps_per_epoch argument from model.fit() and model.evaluate() import sys physical_devices = tf.config.list_physical_devices('GPU') Invalid device or cannot modify virtual devices once initialized.pass define cnn modeldef define_model(): compile modelopt = SGD(learning_rate=0.001, momentum=0.9) create data generatordatagen = ImageDataGenerator(rescale=1.0/255.0) prepare iteratorstrain_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/train/', fit modelhistory = model.fit(train_it, validation_data=test_it, epochs=20, verbose=1) evaluate model_, acc = model.evaluate(test_it, verbose=1) |
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
2.13.0
Custom code
No
OS platform and distribution
Linux Ubuntu 20.04.3
Mobile device
Linux Ubuntu 20.04.3
Python version
3.8.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
CUDA 11.7, cuDNN 8.6
GPU model and memory
3x NVIDIA GeForce RTX 3090
Current behavior?
When trying to run multiple distributed trainings one after another, one of them fails with an
Collective ops is aborted by: ...
error.The reproducer attached to this issue produces the following error:
When run with TF 2.12 there is no such error.
The original code where I have encountered this problem results in
but I wasn't able to reproduce this with a small code snippet.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: