-
-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Hydra + PL + DDP experience #2070
Comments
@jgbos thank you for taking the time to improve Hydra + PL. Sharing some early thoughts here... |
One downside is that resolving could be expensive if the user employs custom resolvers that do heavy computation. See my comment here. |
I've been quite busy lately so I haven't had much to time to address the PR. It does look like PL is updating their approach to allow for easy custom launching. Yet, I think this should be even easier for Hydra users. Since we have configs before execution a user can create a def task(cfg: DictConfig) -> None:
trainer = instantiate(cfg.trainer)
module = instantiate(cfg.module)
if cfg._ddp_fitting:
trainer.fit(module)
else:
trainer.test(module)
@hydra.main(config_path=None, config_name="config")
def main(cfg):
task(cfg)
if __name__ == "__main__":
main() and the PL command to launch each DDP subprocess should just be:
What are you thoughts on this? |
So #394 means that we will be able to detect multirun mode by inspecting Overall the plan sounds good. Does the following example make sense to you @jieru-hu?
|
@jieru-hu apologies for the slow response. I've been swamped. Hopefully have more time in the next week. I'll let you know if you still want to chat. As for my comment, it's related to the PL solution with Hydra, not a Hydra solution specifically. It would definitely be interesting to see what a Hydra solution would be. |
exactly :) |
sure, sounds good! |
So I'm trying to revive Lightning-AI/pytorch-lightning#11617. Where can I find more info on the result of #1805? edit nvm. I found the |
As of 2023/09/09, running a sweep/multirun of multiple PL training jobs with DDP still results in newly spawned ddp processes ( ...
job:
name: train_ddp_process_1 # and there are copies of multirun.yaml corresponding to `train_ddp_process_2` and so on
chdir: null
override_dirname: sweep=ddp_test,tags=ddp_test,trainer.max_epochs=2
id: ???
num: ???
config_name: train.yaml
env_set: {}
env_copy: []
... my hydra config: ...
run:
dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
# Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}
job_logging:
handlers:
file:
filename: ${hydra.runtime.output_dir}/job.log
... @jieru-hu I'm willing to look into it. Can you give me some advice? |
Turns out the problem is on the lightning side (https://github.com/Lightning-AI/lightning/blob/1bee50b31d78c66f1182571a2c022356625a8fbb/src/lightning/fabric/strategies/launchers/subprocess_script.py#L175). Lightning fixes I could create a PR for this. Do you have any other suggestions? @jieru-hu |
Thanks for looking into it -- note that @jieru-hu may not be involved anymore with Hydra. I don't have a good enough understanding of what's happening exactly to provide suggestions, but feel free to submit a PR if you feel like you have a potential solution! |
Thanks for the info. It would actually be a PR on the Lightning side. Right now my monkey-patching solution is as follows: def _hydra_subprocess_cmd(local_rank: int):
"""
Monkey patching for lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd
Temporarily fixes the problem of unnecessarily creating log folders for DDP subprocesses in Hydra multirun/sweep.
"""
import __main__ # local import to avoid https://github.com/Lightning-AI/lightning/issues/15218
from hydra.core.hydra_config import HydraConfig
from hydra.types import RunMode
from hydra.utils import get_original_cwd, to_absolute_path
# when user is using hydra find the absolute path
if __main__.__spec__ is None: # pragma: no-cover
command = [sys.executable, to_absolute_path(sys.argv[0])]
else:
command = [sys.executable, "-m", __main__.__spec__.name]
command += sys.argv[1:] + [f"hydra.job.name=train_ddp_process_{local_rank}", "hydra.output_subdir=null"]
cwd = get_original_cwd()
config = HydraConfig.get()
if config.mode == RunMode.RUN:
command += [f'hydra.run.dir="{config.run.dir}"']
elif config.mode == RunMode.MULTIRUN:
command += [f'hydra.sweep.dir="{config.sweep.dir}"']
print(command)
return command, cwd
lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd
lightning.pytorch.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd This solution pretty much fixes the problem of DDP subprocesses creating unwanted job log folders when running a sweep/multirun with Lightning and DDP training, but one small problem remains as I previously suspected: DDP subprocesses still create multirun.yaml, and the last subproccess' multirun.yaml will overwrite the multirun.yaml created by the main process. It would be nice to allow disabling saving multirun.yaml in subprocesses like DDP (see #2341, #1773, #1937). @omry I'll update once I create a PR for Lightning. Update: |
I tried to solve this awhile ago and even got a PR merged, but apparently it broke some other things and got reverted. You can take a look at the code here: Lightning-AI/pytorch-lightning#11617 |
Hey, thanks for the heads-up. I took a look at the code and the back-and-forths throughout version changes. I agree that "the current method of spawning using sys.argv is not robust". For now, my monkey patch works as a temporary fix for my project, but I'd like to contribute to a better solution if I can. We shall see if we get an update from the creator of Lightning-AI/pytorch-lightning#18175. |
Any update? Is the issue still present? |
…lt. HARD CODE now. check following issues in hydra to sync: facebookresearch/hydra#2070 docs(readme): fix #2 about hyperlink missing in dataprocess commands.
Hello! Any hacky way of getting ddp and hydra multirun to work? Maybe exiting ddp processes in a particular way? Looking forward to this being solved! (hydra is awesome!) |
Some issues I observed from trying out DDP + Hydra + PL (mostly on a single node with multiple GPUs).
By no means this is a comprehensive list of existing issues.
I want to use this for some initial discussions and see how can we improve from Hydra's side.
Hydra Single run
sort of works with rough edges. PL has to hack around the fact that Hydra changes working dir
at job run time (which can be avoided starting Hydra 1.2)
https://github.com/PyTorchLightning/pytorch-lightning/blob/2d043857eef55eb327528b70c8b33c680e82fb7b/pytorch_lightning/strategies/launchers/subprocess_script.py#L140-L145
This is definitely not ideal:
There's no guarantee that the jobs launched in different processes are identical - which I believe is an assumption
DDP holds. One of the root cause is Hydra's configs are resolved lazily, so for resolvers like this
this would be resolved differently running the same Hydra application commands in a separate process.
For more context:
running a simple PL+Hydra application in SINGLE RUN mode with ddp strategy on 2 GPUs will yield outputs like this
(notice two sets of output dir were created, which is not an expected behavior.)
the
train.log
contains logs from rank 0 while train_ddp_process_1.log has the logs from rank 1 worker. (these logs filesare created by Hydra, and can be configured via hydra/output)
IMO the ideal outputs structure should be everything on the same node will be recorded in tht outputs dir and all the nodes log to
train.log
To address this, I think we can make some improvements on both sides
for Hydra:
an extra yaml file in the output dir. The benefit would be users can launch a job with the exact same configs from previous runs. This can potentially resolve Reproducing a Hydra run from previous Hydra job configs #1805 as well
For PL:
We can add a version check for Hydra - if Hydra >1.2, we should urge users to run the Hydra app with
hydra.run.chdir=False
so that PL does not need to do any custom overrides. (we can then get rid of this line)
if we can provide the exact composed config from Hydra, PL can modify the Hydra application commands and passing the composed config. This will ensure that all configurations will be the same across all DDP workers.
Hydra Multi-run
Multirun is more problematic here, and @jgbos has provided more context in this PR
I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)
launching multiple jobs will be taken care by Hydra running the same job on multinode would be taken care by PL. This seems like a cleaner split.
List of related existing issues
The text was updated successfully, but these errors were encountered: