Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Tune] Ray tune cannot be used with pytorch-lightning 1.7.0 due to processes spawned with fork. #27493

Closed
Alfredvc opened this issue Aug 4, 2022 · 15 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks tune Tune-related issues

Comments

@Alfredvc
Copy link

Alfredvc commented Aug 4, 2022

What happened + What you expected to happen

As part of Add support for DDP fork included in pytorch-lightning 1.7.0 calls to:

torch.cuda.device_count()
torch.cuda.is_available()

in the pytorch lightning codebase were replaced with new functions:

pytorch_lightning.utilities.device_parser.num_cuda_devices()
pytorch_lightning.utilities.device_parser.is_cuda_available()

These functions internally create a multiprocessing.Pool with fork

with multiprocessing.get_context("fork").Pool(1) as pool:
        return pool.apply(torch.cuda.device_count)

This call waits forever when run inside an Actor.

(train pid=139, ip=172.22.0.3) 	File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
(train pid=139, ip=172.22.0.3) 		self._bootstrap_inner()
(train pid=139, ip=172.22.0.3) 	File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
(train pid=139, ip=172.22.0.3) 		self.run()
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/ray/tune/function_runner.py", line 277, in run
(train pid=139, ip=172.22.0.3) 		self._entrypoint()
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/ray/tune/function_runner.py", line 349, in entrypoint
(train pid=139, ip=172.22.0.3) 		return self._trainable_func(
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
(train pid=139, ip=172.22.0.3) 		return method(self, *_args, **_kwargs)
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/ray/tune/function_runner.py", line 645, in _trainable_func
(train pid=139, ip=172.22.0.3) 		output = fn()
(train pid=139, ip=172.22.0.3) 	File "test.py", line 9, in train
(train pid=139, ip=172.22.0.3) 		pl.Trainer(
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/argparse.py", line 345, in insert_env_defaults
(train pid=139, ip=172.22.0.3) 		return fn(self, **kwargs)
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 537, in __init__
(train pid=139, ip=172.22.0.3) 		self._setup_on_init(num_sanity_val_steps)
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 618, in _setup_on_init
(train pid=139, ip=172.22.0.3) 		self._log_device_info()
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1739, in _log_device_info
(train pid=139, ip=172.22.0.3) 		if CUDAAccelerator.is_available():
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/cuda.py", line 91, in is_available
(train pid=139, ip=172.22.0.3) 		return device_parser.num_cuda_devices() > 0
(train pid=139, ip=172.22.0.3) 	File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/device_parser.py", line 346, in num_cuda_devices
(train pid=139, ip=172.22.0.3) 		return pool.apply(torch.cuda.device_count)
(train pid=139, ip=172.22.0.3) 	File "/usr/lib/python3.8/multiprocessing/pool.py", line 736, in __exit__
(train pid=139, ip=172.22.0.3) 		self.terminate()
(train pid=139, ip=172.22.0.3) 	File "/usr/lib/python3.8/multiprocessing/pool.py", line 654, in terminate
(train pid=139, ip=172.22.0.3) 		self._terminate()
(train pid=139, ip=172.22.0.3) 	File "/usr/lib/python3.8/multiprocessing/util.py", line 224, in __call__
(train pid=139, ip=172.22.0.3) 		res = self._callback(*self._args, **self._kwargs)
(train pid=139, ip=172.22.0.3) 	File "/usr/lib/python3.8/multiprocessing/pool.py", line 729, in _terminate_pool
(train pid=139, ip=172.22.0.3) 		p.join()
(train pid=139, ip=172.22.0.3) 	File "/usr/lib/python3.8/multiprocessing/process.py", line 149, in join
(train pid=139, ip=172.22.0.3) 		res = self._popen.wait(timeout)
(train pid=139, ip=172.22.0.3) 	File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
(train pid=139, ip=172.22.0.3) 		return self.poll(os.WNOHANG if timeout == 0.0 else 0)
(train pid=139, ip=172.22.0.3) 	File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
(train pid=139, ip=172.22.0.3) 		pid, sts = os.waitpid(self.pid, flag)

This is a critical breaking change given that pytorch_lightning.Trainer calls these methods and therefore cannot be used.

The reproduction script below always hangs. However during my experimentation I found that creating a minimal reproduction script was difficult. Sometimes a script will work, and fail when re-running it. Sometimes changing a seemingly unrelated line of code makes a working script fail. I haven't dived deep enough into the Ray codebase to understand why this is the case.

For my larger projects ray-tune simply cannot be used with pytorch-lightning 1.7.0 as these calls aways hang. My current workaround is to monkeypatch torch.multiprocessing.get_all_start_methods.

    patched_start_methods = [m for m in torch.multiprocessing.get_all_start_methods() if m != "fork"]
    torch.multiprocessing.get_all_start_methods = lambda: patched_start_methods

As far as I can tell it is known that ray does not work with forked processes https://discuss.ray.io/t/best-solution-to-have-multiprocess-working-in-actor/2165/8. However given that pytorch-lightning is a such a widely used library in the ML ecosystem this issue may be worth looking into.

Versions / Dependencies

ray-tune 1.13.0
pytorch 1.12.0
pytorch-lightning 1.7.0
python 3.8.10
OS: Ubuntu 20.04.4 LTS

Reproduction script

import pytorch_lightning as pl
from ray import tune


def train(config):
    pl.Trainer(accelerator="gpu", devices=1)


def run():
    tune.run(
        train,
        resources_per_trial={"cpu": 8, "gpu": 1},
        log_to_file=["stdout.txt", "stderr.txt"], # For some reason removing this line makes the script work
        config={},
        num_samples=1,
        name="Test",
    )


if __name__ == "__main__":
    run()

Submitted to a ray cluster with

ray job submit --runtime-env-json='{"working_dir": "./"}' -- python test.py

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@Alfredvc Alfredvc added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 4, 2022
@krfricke krfricke added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 4, 2022
@krfricke
Copy link
Contributor

krfricke commented Aug 4, 2022

cc @JiahaoYao

@JiahaoYao
Copy link
Contributor

Hi @Alfredvc , i wonder whether you saw the same behavior for pytorch-lightning 1.6 ?

@Alfredvc
Copy link
Author

Alfredvc commented Aug 4, 2022

Hi @JiahaoYao. No, I have been using pytorch-lightning 1.6.4 in my projects without issue. I just double checked by running the reproduction script in 1.6.4 and it works as expected without any problems.

@Alfredvc Alfredvc changed the title [Core | Tune] Ray tune cannot be used with pytorch-lightning 1.7.0 due to processes spawned with fork. [Core][Tune] Ray tune cannot be used with pytorch-lightning 1.7.0 due to processes spawned with fork. Aug 5, 2022
@merrysailor
Copy link

For what it's worth I just bumped into this issue too and agree with @Alfredvc diagnosis.

@xwjiang2010
Copy link
Contributor

@JiahaoYao Is there any plan to support ptl 1.7.0 in Ray-lightning? Would Ray-lightning plugin solve this issue (as that plugin is based off of SpawnedStrategy)?

@JiahaoYao
Copy link
Contributor

@xwjiang2010 , the support to pytorch 1.7 for ray lightning is work in process (ray-project/ray_lightning#194)

@xwjiang2010
Copy link
Contributor

Another report:
Lightning-AI/pytorch-lightning#14292

cc @amogkam

@awaelchli
Copy link

Hi, I replied on the linked issue on PL side with this proposal:

I talked with some of the team and we think it is best if we introduce an environment variable "PL_DISABLE_FORK", that, when set to 1 by user or ray, will have PL avoid all forking calls. How does this solution sound?

@awaelchli
Copy link

awaelchli commented Aug 26, 2022

Update:

I hope this will unblock you soon. Thank you.

@richardliaw richardliaw added tune Tune-related issues core Issues that should be addressed in Ray Core labels Oct 7, 2022
@hora-anyscale
Copy link
Contributor

Per Triage Sync: @jiaodong please repro on master and close if ok.

@amogkam
Copy link
Contributor

amogkam commented Oct 13, 2022

This has been fixed by @krfricke in #28335.

@Alfredvc if you use Ray Tune master, this will no longer hang! The fix will be included in the upcoming 2.1 release.

@amogkam amogkam closed this as completed Oct 13, 2022
@Alfredvc
Copy link
Author

@amogkam Excellent, good job!

@richardliaw
Copy link
Contributor

Another workaround is runtime_env={"env_vars": {"PL_DISABLE_FORK": "1"}}

@awaelchli
Copy link

Nice work, great fix @krfricke @amogkam !

Btw in the mean time we have worked with PyTorch to remove this hack on the Lightning side. First we proposed some changes on the PyTorch side (pytorch/pytorch#83973), after they landed ported the changes back to Lightning (Lightning-AI/pytorch-lightning#14631). Finally, in PyTorch >=1.14 (and some future Lightning version), this hack will no longer be necessary (Lightning-AI/pytorch-lightning#15110) and then eventually Ray can drop this workaround too! <3

@richardliaw
Copy link
Contributor

awesome @awaelchli :) thanks for all your hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

10 participants