-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Tune] Ray tune cannot be used with pytorch-lightning 1.7.0 due to processes spawned with fork. #27493
Comments
cc @JiahaoYao |
Hi @Alfredvc , i wonder whether you saw the same behavior for pytorch-lightning 1.6 ? |
Hi @JiahaoYao. No, I have been using pytorch-lightning 1.6.4 in my projects without issue. I just double checked by running the reproduction script in 1.6.4 and it works as expected without any problems. |
For what it's worth I just bumped into this issue too and agree with @Alfredvc diagnosis. |
@JiahaoYao Is there any plan to support ptl 1.7.0 in Ray-lightning? Would Ray-lightning plugin solve this issue (as that plugin is based off of SpawnedStrategy)? |
@xwjiang2010 , the support to pytorch 1.7 for ray lightning is work in process (ray-project/ray_lightning#194) |
Another report: cc @amogkam |
Hi, I replied on the linked issue on PL side with this proposal:
|
Update:
I hope this will unblock you soon. Thank you. |
Per Triage Sync: @jiaodong please repro on master and close if ok. |
@amogkam Excellent, good job! |
Another workaround is |
Nice work, great fix @krfricke @amogkam ! Btw in the mean time we have worked with PyTorch to remove this hack on the Lightning side. First we proposed some changes on the PyTorch side (pytorch/pytorch#83973), after they landed ported the changes back to Lightning (Lightning-AI/pytorch-lightning#14631). Finally, in PyTorch >=1.14 (and some future Lightning version), this hack will no longer be necessary (Lightning-AI/pytorch-lightning#15110) and then eventually Ray can drop this workaround too! <3 |
awesome @awaelchli :) thanks for all your hard work! |
What happened + What you expected to happen
As part of Add support for DDP fork included in pytorch-lightning 1.7.0 calls to:
in the pytorch lightning codebase were replaced with new functions:
These functions internally create a multiprocessing.Pool with fork
This call waits forever when run inside an Actor.
This is a critical breaking change given that pytorch_lightning.Trainer calls these methods and therefore cannot be used.
The reproduction script below always hangs. However during my experimentation I found that creating a minimal reproduction script was difficult. Sometimes a script will work, and fail when re-running it. Sometimes changing a seemingly unrelated line of code makes a working script fail. I haven't dived deep enough into the Ray codebase to understand why this is the case.
For my larger projects ray-tune simply cannot be used with pytorch-lightning 1.7.0 as these calls aways hang. My current workaround is to monkeypatch
torch.multiprocessing.get_all_start_methods
.As far as I can tell it is known that ray does not work with forked processes https://discuss.ray.io/t/best-solution-to-have-multiprocess-working-in-actor/2165/8. However given that pytorch-lightning is a such a widely used library in the ML ecosystem this issue may be worth looking into.
Versions / Dependencies
ray-tune 1.13.0
pytorch 1.12.0
pytorch-lightning 1.7.0
python 3.8.10
OS: Ubuntu 20.04.4 LTS
Reproduction script
Submitted to a ray cluster with
ray job submit --runtime-env-json='{"working_dir": "./"}' -- python test.py
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: