Improving Hydra + PL + DDP experience #2070

jieru-hu · 2022-03-04T20:13:16Z

Some issues I observed from trying out DDP + Hydra + PL (mostly on a single node with multiple GPUs).
By no means this is a comprehensive list of existing issues.
I want to use this for some initial discussions and see how can we improve from Hydra's side.

Hydra Single run

sort of works with rough edges. PL has to hack around the fact that Hydra changes working dir
at job run time (which can be avoided starting Hydra 1.2)
https://github.com/PyTorchLightning/pytorch-lightning/blob/2d043857eef55eb327528b70c8b33c680e82fb7b/pytorch_lightning/strategies/launchers/subprocess_script.py#L140-L145

This is definitely not ideal:
There's no guarantee that the jobs launched in different processes are identical - which I believe is an assumption
DDP holds. One of the root cause is Hydra's configs are resolved lazily, so for resolvers like this
this would be resolved differently running the same Hydra application commands in a separate process.

For more context:
running a simple PL+Hydra application in SINGLE RUN mode with ddp strategy on 2 GPUs will yield outputs like this
(notice two sets of output dir were created, which is not an expected behavior.)

# the dir/logs created from the main process
outputs
└── 2022-03-04
    └── 11-39-58
        ├── .hydra
        │   ├── config.yaml
        │   ├── hydra.yaml
        │   └── overrides.yaml
        └── train.log
        
# The output dir/logs created from running the same Hydra commands in the subprocess for DDP
.hydra
├── config.yaml
├── hydra.yaml
└── overrides.yaml
train_ddp_process_1.log

the train.log contains logs from rank 0 while train_ddp_process_1.log has the logs from rank 1 worker. (these logs files
are created by Hydra, and can be configured via hydra/output)

IMO the ideal outputs structure should be everything on the same node will be recorded in tht outputs dir and all the nodes log to
train.log

To address this, I think we can make some improvements on both sides
for Hydra:

we should save the final composed config of a job (all resolved, with no interpolation needed.) I don't think there's much downside of dumping
an extra yaml file in the output dir. The benefit would be users can launch a job with the exact same configs from previous runs. This can potentially resolve Reproducing a Hydra run from previous Hydra job configs #1805 as well

For PL:
We can add a version check for Hydra - if Hydra >1.2, we should urge users to run the Hydra app with hydra.run.chdir=False
so that PL does not need to do any custom overrides. (we can then get rid of this line)
if we can provide the exact composed config from Hydra, PL can modify the Hydra application commands and passing the composed config. This will ensure that all configurations will be the same across all DDP workers.

Hydra Multi-run

Multirun is more problematic here, and @jgbos has provided more context in this PR

I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)

# get the .hydra/resolved.yaml config from hydra.runtime.output_dir
job_config=HydraConfig.get().runtime.output_dir + ".hydra/resolved.yaml"
python train.py --config job_config

launching multiple jobs will be taken care by Hydra running the same job on multinode would be taken care by PL. This seems like a cleaner split.

List of related existing issues

The text was updated successfully, but these errors were encountered:

jieru-hu · 2022-03-04T20:14:09Z

@jgbos thank you for taking the time to improve Hydra + PL. Sharing some early thoughts here...

Jasha10 · 2022-03-05T02:23:52Z

I don't think there's much downside of dumping
an extra yaml file in the output dir.

One downside is that resolving could be expensive if the user employs custom resolvers that do heavy computation. See my comment here.

jgbos · 2022-03-05T03:43:46Z

I've been quite busy lately so I haven't had much to time to address the PR. It does look like PL is updating their approach to allow for easy custom launching. Yet, I think this should be even easier for Hydra users. Since we have configs before execution a user can create a pl_main.py in their repo with:

def task(cfg: DictConfig) -> None:
    trainer = instantiate(cfg.trainer)
    module = instantiate(cfg.module)

    if cfg._ddp_fitting:
        trainer.fit(module)
    else:
        trainer.test(module)

@hydra.main(config_path=None, config_name="config")
def main(cfg):
    task(cfg)


if __name__ == "__main__":
    main()

and the PL command to launch each DDP subprocess should just be:

python -m my_repo.pl_main -cn config.yaml -cp <path to config.yaml> +_ddp_fitting=TrainerFn.FITTING hydra.run.dir=<cwd>

What are you thoughts on this?

jieru-hu · 2022-03-07T21:07:34Z

thanks @jgbos for taking a look :)

I'm not exactly sure what is the suggestion here (probably due to my lack of context in this area).
Could we chat in zulip or meet and discuss this?

jieru-hu · 2022-03-07T23:02:24Z

thanks @jgbos @Jasha10 for taking a look. I've added some more thoughts on MULTIRUN above. Pls take a look when you get a chance.

Jasha10 · 2022-03-10T00:15:20Z

Let me cross-link to my comment here on @jgbos's PR.

Jasha10 · 2022-03-10T00:17:09Z

I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)

So #394 means that we will be able to detect multirun mode by inspecting HydraConfig.get(), right?

Overall the plan sounds good.

Does the following example make sense to you @jieru-hu?

User invokes python script.py abc/pqr=xyz foo=0,1 --multirun
This spawns two multirun jobs (one with foo=0 and one with foo=1). For this example, let's say the working directories for these jobs are ./multirun/.../0 and ./multirun/.../1
Job 0 should save a resolved copy of the job config at ./multirun/.../0/.hydra/resolved.yaml, and similarly for job 1.
In job 0, PL should launch the subprocess using some solution from #1805, pointing Hydra to the ./multirun/.../0/.hydra directory so that an identical config can be produced. PL does the same thing for Job 1, referencing ./multirun/.../1/.hydra.

jgbos · 2022-03-10T14:58:57Z

@jieru-hu apologies for the slow response. I've been swamped. Hopefully have more time in the next week. I'll let you know if you still want to chat. As for my comment, it's related to the PL solution with Hydra, not a Hydra solution specifically. It would definitely be interesting to see what a Hydra solution would be.

jieru-hu · 2022-03-11T20:06:18Z

I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)

So #394 means that we will be able to detect multirun mode by inspecting HydraConfig.get(), right?

Overall the plan sounds good.

Does the following example make sense to you @jieru-hu?

User invokes python script.py abc/pqr=xyz foo=0,1 --multirun

This spawns two multirun jobs (one with foo=0 and one with foo=1). For this example, let's say the working directories for these jobs are ./multirun/.../0 and ./multirun/.../1

Job 0 should save a resolved copy of the job config at ./multirun/.../0/.hydra/resolved.yaml, and similarly for job 1.

In job 0, PL should launch the subprocess using some solution from #1805, pointing Hydra to the ./multirun/.../0/.hydra directory so that an identical config can be produced. PL does the same thing for Job 1, referencing ./multirun/.../1/.hydra.

exactly :)

jieru-hu · 2022-03-11T20:06:36Z

@jieru-hu apologies for the slow response. I've been swamped. Hopefully have more time in the next week. I'll let you know if you still want to chat. As for my comment, it's related to the PL solution with Hydra, not a Hydra solution specifically. It would definitely be interesting to see what a Hydra solution would be.

sure, sounds good!

jgbos · 2022-08-16T16:42:22Z

So I'm trying to revive Lightning-AI/pytorch-lightning#11617. Where can I find more info on the result of #1805?

edit

nvm. I found the re-run documentation.

libokj · 2023-09-09T15:09:10Z

As of 2023/09/09, running a sweep/multirun of multiple PL training jobs with DDP still results in newly spawned ddp processes (train_ddp_process_1, etc.) creating multiple output_dirs of the same multirun with multiple copies of multirun.yaml:

...
  job:
    name: train_ddp_process_1 # and there are copies of multirun.yaml corresponding to `train_ddp_process_2` and so on
    chdir: null
    override_dirname: sweep=ddp_test,tags=ddp_test,trainer.max_epochs=2
    id: ???
    num: ???
    config_name: train.yaml
    env_set: {}
    env_copy: []
...

my hydra config:

...
run:
  dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
  dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
  # Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
  subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}

job_logging:
  handlers:
    file:
      filename: ${hydra.runtime.output_dir}/job.log
...

@jieru-hu I'm willing to look into it. Can you give me some advice?

libokj · 2023-09-12T13:10:34Z

As of 2023/09/09, running a sweep/multirun of multiple PL training jobs with DDP still results in newly spawned ddp processes (train_ddp_process_1, etc.) creating multiple output_dirs of the same multirun with multiple copies of multirun.yaml:

...
  job:
    name: train_ddp_process_1 # and there are copies of multirun.yaml corresponding to `train_ddp_process_2` and so on
    chdir: null
    override_dirname: sweep=ddp_test,tags=ddp_test,trainer.max_epochs=2
    id: ???
    num: ???
    config_name: train.yaml
    env_set: {}
    env_copy: []
...

my hydra config:

...
run:
  dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
  dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
  # Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
  subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}

job_logging:
  handlers:
    file:
      filename: ${hydra.runtime.output_dir}/job.log
...

@jieru-hu I'm willing to look into it. Can you give me some advice?

Turns out the problem is on the lightning side (https://github.com/Lightning-AI/lightning/blob/1bee50b31d78c66f1182571a2c022356625a8fbb/src/lightning/fabric/strategies/launchers/subprocess_script.py#L175). Lightning fixes hydra.run.dir for all subproccesses launched for hydra, but that is only used in single run mode, not in sweep/multirun, which uses hydra.sweep.dir and hydra.sweep.subdir to configure output_dir. I think changing hydra.run.dir to hydra.runtime.output_dir can partially fix this issue, but that creates a minor problem of subprocess multirun.yaml overwriting the multirun.yaml created by the main process.

I could create a PR for this. Do you have any other suggestions? @jieru-hu

odelalleau · 2023-09-12T15:03:32Z

Thanks for looking into it -- note that @jieru-hu may not be involved anymore with Hydra.

I don't have a good enough understanding of what's happening exactly to provide suggestions, but feel free to submit a PR if you feel like you have a potential solution!

libokj · 2023-09-14T13:32:55Z

Thanks for looking into it -- note that @jieru-hu may not be involved anymore with Hydra.

I don't have a good enough understanding of what's happening exactly to provide suggestions, but feel free to submit a PR if you feel like you have a potential solution!

Thanks for the info. It would actually be a PR on the Lightning side. Right now my monkey-patching solution is as follows:

def _hydra_subprocess_cmd(local_rank: int):
    """
    Monkey patching for lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd
    Temporarily fixes the problem of unnecessarily creating log folders for DDP subprocesses in Hydra multirun/sweep.
    """
    import __main__  # local import to avoid https://github.com/Lightning-AI/lightning/issues/15218
    from hydra.core.hydra_config import HydraConfig
    from hydra.types import RunMode
    from hydra.utils import get_original_cwd, to_absolute_path

    # when user is using hydra find the absolute path
    if __main__.__spec__ is None:  # pragma: no-cover
        command = [sys.executable, to_absolute_path(sys.argv[0])]
    else:
        command = [sys.executable, "-m", __main__.__spec__.name]

    command += sys.argv[1:] + [f"hydra.job.name=train_ddp_process_{local_rank}", "hydra.output_subdir=null"]

    cwd = get_original_cwd()
    config = HydraConfig.get()

    if config.mode == RunMode.RUN:
        command += [f'hydra.run.dir="{config.run.dir}"']
    elif config.mode == RunMode.MULTIRUN:
        command += [f'hydra.sweep.dir="{config.sweep.dir}"']

    print(command)

    return command, cwd


lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd
lightning.pytorch.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd

This solution pretty much fixes the problem of DDP subprocesses creating unwanted job log folders when running a sweep/multirun with Lightning and DDP training, but one small problem remains as I previously suspected: DDP subprocesses still create multirun.yaml, and the last subproccess' multirun.yaml will overwrite the multirun.yaml created by the main process. It would be nice to allow disabling saving multirun.yaml in subprocesses like DDP (see #2341, #1773, #1937). @omry

I'll update once I create a PR for Lightning.

Update:
Joined the discussion on an existing issue for Lightning: Lightning-AI/pytorch-lightning#18175.

jgbos · 2023-09-15T11:55:07Z

I tried to solve this awhile ago and even got a PR merged, but apparently it broke some other things and got reverted. You can take a look at the code here: Lightning-AI/pytorch-lightning#11617

libokj · 2023-09-17T17:52:07Z

I tried to solve this awhile ago and even got a PR merged, but apparently it broke some other things and got reverted. You can take a look at the code here: Lightning-AI/lightning#11617

Hey, thanks for the heads-up. I took a look at the code and the back-and-forths throughout version changes. I agree that "the current method of spawning using sys.argv is not robust". For now, my monkey patch works as a temporary fix for my project, but I'd like to contribute to a better solution if I can. We shall see if we get an update from the creator of Lightning-AI/pytorch-lightning#18175.

mfoglio · 2024-03-13T18:09:06Z

Any update? Is the issue still present?

…lt. HARD CODE now. check following issues in hydra to sync: facebookresearch/hydra#2070 docs(readme): fix #2 about hyperlink missing in dataprocess commands.

konstantinos-pitas-cognex · 2024-11-13T13:35:10Z

Hello! Any hacky way of getting ddp and hydra multirun to work? Maybe exiting ddp processes in a particular way? Looking forward to this being solved! (hydra is awesome!)

jieru-hu added the internal label Mar 7, 2022

jieru-hu self-assigned this Mar 11, 2022

jieru-hu added this to the Hydra 1.2.0 milestone Mar 23, 2022

pixelb added the priority_medium label Mar 29, 2022

Jasha10 mentioned this issue Mar 31, 2022

Add torchrun launcher plugin #2119

Merged

jieru-hu added priority_low and removed priority_medium labels Apr 19, 2022

pixelb modified the milestones: Hydra 1.2.0, Hydra 1.3.0 May 18, 2022

johnnynunez mentioned this issue Jun 16, 2022

Release v1.4.0 changes ashleve/lightning-hydra-template#308

Closed

34 tasks

ashleve mentioned this issue Jul 15, 2022

Problems with DDP + hydra ashleve/lightning-hydra-template#393

Closed

libokj mentioned this issue Mar 28, 2023

[Bug] OmegaConf produces hydra interpolation error in ddp_spawn training in PyTorch Lightning #2629

Closed

2 tasks

libokj mentioned this issue Sep 14, 2023

Support multi-run with hydra + DDP Lightning-AI/pytorch-lightning#18175

Draft

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Hydra + PL + DDP experience #2070

Improving Hydra + PL + DDP experience #2070

jieru-hu commented Mar 4, 2022 •

edited

Loading

jieru-hu commented Mar 4, 2022

Jasha10 commented Mar 5, 2022

jgbos commented Mar 5, 2022

jieru-hu commented Mar 7, 2022

jieru-hu commented Mar 7, 2022

Jasha10 commented Mar 10, 2022 •

edited

Loading

Jasha10 commented Mar 10, 2022

jgbos commented Mar 10, 2022

jieru-hu commented Mar 11, 2022

jieru-hu commented Mar 11, 2022

jgbos commented Aug 16, 2022 •

edited

Loading

libokj commented Sep 9, 2023 •

edited

Loading

libokj commented Sep 12, 2023 •

edited

Loading

odelalleau commented Sep 12, 2023

libokj commented Sep 14, 2023 •

edited

Loading

jgbos commented Sep 15, 2023 •

edited

Loading

libokj commented Sep 17, 2023 •

edited

Loading

mfoglio commented Mar 13, 2024

konstantinos-pitas-cognex commented Nov 13, 2024

Improving Hydra + PL + DDP experience #2070

Improving Hydra + PL + DDP experience #2070

Comments

jieru-hu commented Mar 4, 2022 • edited Loading

Hydra Single run

Hydra Multi-run

List of related existing issues

jieru-hu commented Mar 4, 2022

Jasha10 commented Mar 5, 2022

jgbos commented Mar 5, 2022

jieru-hu commented Mar 7, 2022

jieru-hu commented Mar 7, 2022

Jasha10 commented Mar 10, 2022 • edited Loading

Jasha10 commented Mar 10, 2022

jgbos commented Mar 10, 2022

jieru-hu commented Mar 11, 2022

jieru-hu commented Mar 11, 2022

jgbos commented Aug 16, 2022 • edited Loading

libokj commented Sep 9, 2023 • edited Loading

libokj commented Sep 12, 2023 • edited Loading

odelalleau commented Sep 12, 2023

libokj commented Sep 14, 2023 • edited Loading

jgbos commented Sep 15, 2023 • edited Loading

libokj commented Sep 17, 2023 • edited Loading

mfoglio commented Mar 13, 2024

konstantinos-pitas-cognex commented Nov 13, 2024

jieru-hu commented Mar 4, 2022 •

edited

Loading

Jasha10 commented Mar 10, 2022 •

edited

Loading

jgbos commented Aug 16, 2022 •

edited

Loading

libokj commented Sep 9, 2023 •

edited

Loading

libokj commented Sep 12, 2023 •

edited

Loading

libokj commented Sep 14, 2023 •

edited

Loading

jgbos commented Sep 15, 2023 •

edited

Loading

libokj commented Sep 17, 2023 •

edited

Loading