Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU need a fix : to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call #1925

Open
FurkanGozukara opened this issue Feb 7, 2025 · 0 comments

Comments

@FurkanGozukara
Copy link

I am trying to train 8x GPU LoRA atm

Here the messages

It hangs as the message indicates @kohya-ss

Image

accelerator device: cuda:5
                    INFO     Checking the state dict: Diffusers or BFL, dev or schnell                                                                                                                                                                                         flux_utils.py:43
                    INFO     Building Flux model dev from BFL checkpoint                                                                                                                                                                                                      flux_utils.py:101
                    INFO     Loading state dict from /workspace/flux1-dev.safetensors                                                                                                                                                                                         flux_utils.py:118
                    INFO     Loaded Flux: <All keys matched successfully>                                                                                                                                                                                                     flux_utils.py:137
                    INFO     Building CLIP-L                                                                                                                                                                                                                                  flux_utils.py:179
                    INFO     Loading state dict from /workspace/clip_l.safetensors                                                                                                                                                                                            flux_utils.py:275
                    INFO     Loaded CLIP-L: <All keys matched successfully>                                                                                                                                                                                                   flux_utils.py:278
                    INFO     Loading state dict from /workspace/t5xxl_fp16.safetensors                                                                                                                                                                                        flux_utils.py:330
2025-02-07 22:21:42 INFO     Loaded T5xxl: <All keys matched successfully>                                                                                                                                                                                                    flux_utils.py:333
                    INFO     Building AutoEncoder                                                                                                                                                                                                                             flux_utils.py:144
                    INFO     Loading state dict from /workspace/ae.safetensors                                                                                                                                                                                                flux_utils.py:149
                    INFO     Loaded AE: <All keys matched successfully>                                                                                                                                                                                                       flux_utils.py:152
                    INFO     [Dataset 0]                                                                                                                                                                                                                                     train_util.py:2589
                    INFO     caching latents with caching strategy.                                                                                                                                                                                                          train_util.py:1097
                    INFO     caching latents...                                                                                                                                                                                                                              train_util.py:1146
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 1208.53it/s]
[rank6]:[W207 22:21:42.337647981 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
2025-02-07 22:21:42 INFO     Loaded T5xxl: <All keys matched successfully>                                                                                                                                                                                                    flux_utils.py:333
                    INFO     Building AutoEncoder                                                                                                                                                                                                                             flux_utils.py:144
                    INFO     Loading state dict from /workspace/ae.safetensors                                                                                                                                                                                                flux_utils.py:149
2025-02-07 22:21:42 INFO     Loaded T5xxl: <All keys matched successfully>                                                                                                                                                                                                    flux_utils.py:333
                    INFO     Loaded AE: <All keys matched successfully>                                                                                                                                                                                                       flux_utils.py:152
                    INFO     Building AutoEncoder                                                                                                                                                                                                                             flux_utils.py:144
                    INFO     Loading state dict from /workspace/ae.safetensors                                                                                                                                                                                                flux_utils.py:149
                    INFO     [Dataset 0]                                                                                                                                                                                                                                     train_util.py:2589
                    INFO     caching latents with caching strategy.                                                                                                                                                                                                          train_util.py:1097
                    INFO     caching latents...                                                                                                                                                                                                                              train_util.py:1146
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 905.60it/s]
[rank3]:[W207 22:21:42.619126102 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
                    INFO     Loaded AE: <All keys matched successfully>                                                                                                                                                                                                       flux_utils.py:152
2025-02-07 22:21:42 INFO     Loaded T5xxl: <All keys matched successfully>                                                                                                                                                                                                    flux_utils.py:333
                    INFO     Building AutoEncoder                                                                                                                                                                                                                             flux_utils.py:144
                    INFO     [Dataset 0]                                                                                                                                                                                                                                     train_util.py:2589
                    INFO     caching latents with caching strategy.                                                                                                                                                                                                          train_util.py:1097
                    INFO     caching latents...                                                                                                                                                                                                                              train_util.py:1146
  0%|                                                                                                                                                                                                                                                                   | 0/28 [00:00<?, ?it/s]                    INFO     Loading state dict from /workspace/ae.safetensors                                                                                                                                                                                                flux_utils.py:149
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 893.88it/s]
[rank4]:[W207 22:21:42.712978449 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
                    INFO     Loaded AE: <All keys matched successfully>                                                                                                                                                                                                       flux_utils.py:152
                    INFO     [Dataset 0]                                                                                                                                                                                                                                     train_util.py:2589
                    INFO     caching latents with caching strategy.                                                                                                                                                                                                          train_util.py:1097
                    INFO     caching latents...                                                                                                                                                                                                                              train_util.py:1146
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 852.14it/s]
[rank7]:[W207 22:21:42.916547387 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7]  using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
2025-02-07 22:21:47 INFO     Loaded T5xxl: <All keys matched successfully>                                                                                                                                                                                                    flux_utils.py:333
                    INFO     Building AutoEncoder                                                                                                                                                                                                                             flux_utils.py:144
                    INFO     Loading state dict from /workspace/ae.safetensors                                                                                                                                                                                                flux_utils.py:149
                    INFO     Loaded AE: <All keys matched successfully>                                                                                                                                                                                                       flux_utils.py:152
                    INFO     [Dataset 0]                                                                                                                                                                                                                                     train_util.py:2589
                    INFO     caching latents with caching strategy.                                                                                                                                                                                                          train_util.py:1097
                    INFO     caching latents...                                                                                                                                                                                                                              train_util.py:1146
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 596.81it/s]
[rank5]:[W207 22:21:47.552844196 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant