You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multi GPU need a fix : to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call
#1925
Open
FurkanGozukara opened this issue
Feb 7, 2025
· 0 comments
accelerator device: cuda:5
INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:43
INFO Building Flux model dev from BFL checkpoint flux_utils.py:101
INFO Loading state dict from /workspace/flux1-dev.safetensors flux_utils.py:118
INFO Loaded Flux: <All keys matched successfully> flux_utils.py:137
INFO Building CLIP-L flux_utils.py:179
INFO Loading state dict from /workspace/clip_l.safetensors flux_utils.py:275
INFO Loaded CLIP-L: <All keys matched successfully> flux_utils.py:278
INFO Loading state dict from /workspace/t5xxl_fp16.safetensors flux_utils.py:330
2025-02-07 22:21:42 INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:333
INFO Building AutoEncoder flux_utils.py:144
INFO Loading state dict from /workspace/ae.safetensors flux_utils.py:149
INFO Loaded AE: <All keys matched successfully> flux_utils.py:152
INFO [Dataset 0] train_util.py:2589
INFO caching latents with caching strategy. train_util.py:1097
INFO caching latents... train_util.py:1146
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 1208.53it/s]
[rank6]:[W207 22:21:42.337647981 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
2025-02-07 22:21:42 INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:333
INFO Building AutoEncoder flux_utils.py:144
INFO Loading state dict from /workspace/ae.safetensors flux_utils.py:149
2025-02-07 22:21:42 INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:333
INFO Loaded AE: <All keys matched successfully> flux_utils.py:152
INFO Building AutoEncoder flux_utils.py:144
INFO Loading state dict from /workspace/ae.safetensors flux_utils.py:149
INFO [Dataset 0] train_util.py:2589
INFO caching latents with caching strategy. train_util.py:1097
INFO caching latents... train_util.py:1146
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 905.60it/s]
[rank3]:[W207 22:21:42.619126102 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
INFO Loaded AE: <All keys matched successfully> flux_utils.py:152
2025-02-07 22:21:42 INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:333
INFO Building AutoEncoder flux_utils.py:144
INFO [Dataset 0] train_util.py:2589
INFO caching latents with caching strategy. train_util.py:1097
INFO caching latents... train_util.py:1146
0%| | 0/28 [00:00<?, ?it/s] INFO Loading state dict from /workspace/ae.safetensors flux_utils.py:149
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 893.88it/s]
[rank4]:[W207 22:21:42.712978449 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
INFO Loaded AE: <All keys matched successfully> flux_utils.py:152
INFO [Dataset 0] train_util.py:2589
INFO caching latents with caching strategy. train_util.py:1097
INFO caching latents... train_util.py:1146
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 852.14it/s]
[rank7]:[W207 22:21:42.916547387 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
2025-02-07 22:21:47 INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:333
INFO Building AutoEncoder flux_utils.py:144
INFO Loading state dict from /workspace/ae.safetensors flux_utils.py:149
INFO Loaded AE: <All keys matched successfully> flux_utils.py:152
INFO [Dataset 0] train_util.py:2589
INFO caching latents with caching strategy. train_util.py:1097
INFO caching latents... train_util.py:1146
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 596.81it/s]
[rank5]:[W207 22:21:47.552844196 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
The text was updated successfully, but these errors were encountered:
I am trying to train 8x GPU LoRA atm
Here the messages
It hangs as the message indicates @kohya-ss
The text was updated successfully, but these errors were encountered: