Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Hydra + PL + DDP experience #2070

Open
2 tasks done
jieru-hu opened this issue Mar 4, 2022 · 19 comments
Open
2 tasks done

Improving Hydra + PL + DDP experience #2070

jieru-hu opened this issue Mar 4, 2022 · 19 comments

Comments

@jieru-hu
Copy link
Contributor

jieru-hu commented Mar 4, 2022

Some issues I observed from trying out DDP + Hydra + PL (mostly on a single node with multiple GPUs).
By no means this is a comprehensive list of existing issues.
I want to use this for some initial discussions and see how can we improve from Hydra's side.

Hydra Single run

sort of works with rough edges. PL has to hack around the fact that Hydra changes working dir
at job run time (which can be avoided starting Hydra 1.2)
https://github.com/PyTorchLightning/pytorch-lightning/blob/2d043857eef55eb327528b70c8b33c680e82fb7b/pytorch_lightning/strategies/launchers/subprocess_script.py#L140-L145

This is definitely not ideal:
There's no guarantee that the jobs launched in different processes are identical - which I believe is an assumption
DDP holds. One of the root cause is Hydra's configs are resolved lazily, so for resolvers like this
this would be resolved differently running the same Hydra application commands in a separate process.

For more context:
running a simple PL+Hydra application in SINGLE RUN mode with ddp strategy on 2 GPUs will yield outputs like this
(notice two sets of output dir were created, which is not an expected behavior.)

# the dir/logs created from the main process
outputs
└── 2022-03-04
    └── 11-39-58
        ├── .hydra
        │   ├── config.yaml
        │   ├── hydra.yaml
        │   └── overrides.yaml
        └── train.log
        
# The output dir/logs created from running the same Hydra commands in the subprocess for DDP
.hydra
├── config.yaml
├── hydra.yaml
└── overrides.yaml
train_ddp_process_1.log

the train.log contains logs from rank 0 while train_ddp_process_1.log has the logs from rank 1 worker. (these logs files
are created by Hydra, and can be configured via hydra/output)

IMO the ideal outputs structure should be everything on the same node will be recorded in tht outputs dir and all the nodes log to
train.log

To address this, I think we can make some improvements on both sides
for Hydra:

  • we should save the final composed config of a job (all resolved, with no interpolation needed.) I don't think there's much downside of dumping
    an extra yaml file in the output dir. The benefit would be users can launch a job with the exact same configs from previous runs. This can potentially resolve Reproducing a Hydra run from previous Hydra job configs #1805 as well

For PL:
We can add a version check for Hydra - if Hydra >1.2, we should urge users to run the Hydra app with hydra.run.chdir=False
so that PL does not need to do any custom overrides. (we can then get rid of this line)
if we can provide the exact composed config from Hydra, PL can modify the Hydra application commands and passing the composed config. This will ensure that all configurations will be the same across all DDP workers.

Hydra Multi-run

Multirun is more problematic here, and @jgbos has provided more context in this PR

I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)

# get the .hydra/resolved.yaml config from hydra.runtime.output_dir
job_config=HydraConfig.get().runtime.output_dir + ".hydra/resolved.yaml"
python train.py --config job_config 

launching multiple jobs will be taken care by Hydra running the same job on multinode would be taken care by PL. This seems like a cleaner split.

List of related existing issues

@jieru-hu
Copy link
Contributor Author

jieru-hu commented Mar 4, 2022

@jgbos thank you for taking the time to improve Hydra + PL. Sharing some early thoughts here...

@Jasha10
Copy link
Collaborator

Jasha10 commented Mar 5, 2022

I don't think there's much downside of dumping
an extra yaml file in the output dir.

One downside is that resolving could be expensive if the user employs custom resolvers that do heavy computation. See my comment here.

@jgbos
Copy link
Contributor

jgbos commented Mar 5, 2022

I've been quite busy lately so I haven't had much to time to address the PR. It does look like PL is updating their approach to allow for easy custom launching. Yet, I think this should be even easier for Hydra users. Since we have configs before execution a user can create a pl_main.py in their repo with:

def task(cfg: DictConfig) -> None:
    trainer = instantiate(cfg.trainer)
    module = instantiate(cfg.module)

    if cfg._ddp_fitting:
        trainer.fit(module)
    else:
        trainer.test(module)

@hydra.main(config_path=None, config_name="config")
def main(cfg):
    task(cfg)


if __name__ == "__main__":
    main()

and the PL command to launch each DDP subprocess should just be:

python -m my_repo.pl_main -cn config.yaml -cp <path to config.yaml> +_ddp_fitting=TrainerFn.FITTING hydra.run.dir=<cwd>

What are you thoughts on this?

@jieru-hu
Copy link
Contributor Author

jieru-hu commented Mar 7, 2022

thanks @jgbos for taking a look :)

I'm not exactly sure what is the suggestion here (probably due to my lack of context in this area).
Could we chat in zulip or meet and discuss this?

@jieru-hu
Copy link
Contributor Author

jieru-hu commented Mar 7, 2022

thanks @jgbos @Jasha10 for taking a look. I've added some more thoughts on MULTIRUN above. Pls take a look when you get a chance.

@Jasha10
Copy link
Collaborator

Jasha10 commented Mar 10, 2022

Let me cross-link to my comment here on @jgbos's PR.

@Jasha10
Copy link
Collaborator

Jasha10 commented Mar 10, 2022

I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)

So #394 means that we will be able to detect multirun mode by inspecting HydraConfig.get(), right?

Overall the plan sounds good.

Does the following example make sense to you @jieru-hu?

  • User invokes python script.py abc/pqr=xyz foo=0,1 --multirun
  • This spawns two multirun jobs (one with foo=0 and one with foo=1). For this example, let's say the working directories for these jobs are ./multirun/.../0 and ./multirun/.../1
  • Job 0 should save a resolved copy of the job config at ./multirun/.../0/.hydra/resolved.yaml, and similarly for job 1.
  • In job 0, PL should launch the subprocess using some solution from #1805, pointing Hydra to the ./multirun/.../0/.hydra directory so that an identical config can be produced. PL does the same thing for Job 1, referencing ./multirun/.../1/.hydra.

@jgbos
Copy link
Contributor

jgbos commented Mar 10, 2022

@jieru-hu apologies for the slow response. I've been swamped. Hopefully have more time in the next week. I'll let you know if you still want to chat. As for my comment, it's related to the PL solution with Hydra, not a Hydra solution specifically. It would definitely be interesting to see what a Hydra solution would be.

@jieru-hu
Copy link
Contributor Author

I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)

So #394 means that we will be able to detect multirun mode by inspecting HydraConfig.get(), right?

Overall the plan sounds good.

Does the following example make sense to you @jieru-hu?

  • User invokes python script.py abc/pqr=xyz foo=0,1 --multirun
  • This spawns two multirun jobs (one with foo=0 and one with foo=1). For this example, let's say the working directories for these jobs are ./multirun/.../0 and ./multirun/.../1
  • Job 0 should save a resolved copy of the job config at ./multirun/.../0/.hydra/resolved.yaml, and similarly for job 1.
  • In job 0, PL should launch the subprocess using some solution from #1805, pointing Hydra to the ./multirun/.../0/.hydra directory so that an identical config can be produced. PL does the same thing for Job 1, referencing ./multirun/.../1/.hydra.

exactly :)

@jieru-hu
Copy link
Contributor Author

@jieru-hu apologies for the slow response. I've been swamped. Hopefully have more time in the next week. I'll let you know if you still want to chat. As for my comment, it's related to the PL solution with Hydra, not a Hydra solution specifically. It would definitely be interesting to see what a Hydra solution would be.

sure, sounds good!

@jgbos
Copy link
Contributor

jgbos commented Aug 16, 2022

So I'm trying to revive Lightning-AI/pytorch-lightning#11617. Where can I find more info on the result of #1805?

edit

nvm. I found the re-run documentation.

@libokj
Copy link

libokj commented Sep 9, 2023

As of 2023/09/09, running a sweep/multirun of multiple PL training jobs with DDP still results in newly spawned ddp processes (train_ddp_process_1, etc.) creating multiple output_dirs of the same multirun with multiple copies of multirun.yaml:

...
  job:
    name: train_ddp_process_1 # and there are copies of multirun.yaml corresponding to `train_ddp_process_2` and so on
    chdir: null
    override_dirname: sweep=ddp_test,tags=ddp_test,trainer.max_epochs=2
    id: ???
    num: ???
    config_name: train.yaml
    env_set: {}
    env_copy: []
...

my hydra config:

...
run:
  dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
  dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
  # Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
  subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}

job_logging:
  handlers:
    file:
      filename: ${hydra.runtime.output_dir}/job.log
...

@jieru-hu I'm willing to look into it. Can you give me some advice?

@libokj
Copy link

libokj commented Sep 12, 2023

As of 2023/09/09, running a sweep/multirun of multiple PL training jobs with DDP still results in newly spawned ddp processes (train_ddp_process_1, etc.) creating multiple output_dirs of the same multirun with multiple copies of multirun.yaml:

...
  job:
    name: train_ddp_process_1 # and there are copies of multirun.yaml corresponding to `train_ddp_process_2` and so on
    chdir: null
    override_dirname: sweep=ddp_test,tags=ddp_test,trainer.max_epochs=2
    id: ???
    num: ???
    config_name: train.yaml
    env_set: {}
    env_copy: []
...

my hydra config:

...
run:
  dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
  dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
  # Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
  subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}

job_logging:
  handlers:
    file:
      filename: ${hydra.runtime.output_dir}/job.log
...

@jieru-hu I'm willing to look into it. Can you give me some advice?

Turns out the problem is on the lightning side (https://github.com/Lightning-AI/lightning/blob/1bee50b31d78c66f1182571a2c022356625a8fbb/src/lightning/fabric/strategies/launchers/subprocess_script.py#L175). Lightning fixes hydra.run.dir for all subproccesses launched for hydra, but that is only used in single run mode, not in sweep/multirun, which uses hydra.sweep.dir and hydra.sweep.subdir to configure output_dir. I think changing hydra.run.dir to hydra.runtime.output_dir can partially fix this issue, but that creates a minor problem of subprocess multirun.yaml overwriting the multirun.yaml created by the main process.

I could create a PR for this. Do you have any other suggestions? @jieru-hu

@odelalleau
Copy link
Collaborator

Thanks for looking into it -- note that @jieru-hu may not be involved anymore with Hydra.

I don't have a good enough understanding of what's happening exactly to provide suggestions, but feel free to submit a PR if you feel like you have a potential solution!

@libokj
Copy link

libokj commented Sep 14, 2023

Thanks for looking into it -- note that @jieru-hu may not be involved anymore with Hydra.

I don't have a good enough understanding of what's happening exactly to provide suggestions, but feel free to submit a PR if you feel like you have a potential solution!

Thanks for the info. It would actually be a PR on the Lightning side. Right now my monkey-patching solution is as follows:

def _hydra_subprocess_cmd(local_rank: int):
    """
    Monkey patching for lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd
    Temporarily fixes the problem of unnecessarily creating log folders for DDP subprocesses in Hydra multirun/sweep.
    """
    import __main__  # local import to avoid https://github.com/Lightning-AI/lightning/issues/15218
    from hydra.core.hydra_config import HydraConfig
    from hydra.types import RunMode
    from hydra.utils import get_original_cwd, to_absolute_path

    # when user is using hydra find the absolute path
    if __main__.__spec__ is None:  # pragma: no-cover
        command = [sys.executable, to_absolute_path(sys.argv[0])]
    else:
        command = [sys.executable, "-m", __main__.__spec__.name]

    command += sys.argv[1:] + [f"hydra.job.name=train_ddp_process_{local_rank}", "hydra.output_subdir=null"]

    cwd = get_original_cwd()
    config = HydraConfig.get()

    if config.mode == RunMode.RUN:
        command += [f'hydra.run.dir="{config.run.dir}"']
    elif config.mode == RunMode.MULTIRUN:
        command += [f'hydra.sweep.dir="{config.sweep.dir}"']

    print(command)

    return command, cwd


lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd
lightning.pytorch.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd

This solution pretty much fixes the problem of DDP subprocesses creating unwanted job log folders when running a sweep/multirun with Lightning and DDP training, but one small problem remains as I previously suspected: DDP subprocesses still create multirun.yaml, and the last subproccess' multirun.yaml will overwrite the multirun.yaml created by the main process. It would be nice to allow disabling saving multirun.yaml in subprocesses like DDP (see #2341, #1773, #1937). @omry

I'll update once I create a PR for Lightning.

Update:
Joined the discussion on an existing issue for Lightning: Lightning-AI/pytorch-lightning#18175.

@jgbos
Copy link
Contributor

jgbos commented Sep 15, 2023

I tried to solve this awhile ago and even got a PR merged, but apparently it broke some other things and got reverted. You can take a look at the code here: Lightning-AI/pytorch-lightning#11617

@libokj
Copy link

libokj commented Sep 17, 2023

I tried to solve this awhile ago and even got a PR merged, but apparently it broke some other things and got reverted. You can take a look at the code here: Lightning-AI/lightning#11617

Hey, thanks for the heads-up. I took a look at the code and the back-and-forths throughout version changes. I agree that "the current method of spawning using sys.argv is not robust". For now, my monkey patch works as a temporary fix for my project, but I'd like to contribute to a better solution if I can. We shall see if we get an update from the creator of Lightning-AI/pytorch-lightning#18175.

@mfoglio
Copy link

mfoglio commented Mar 13, 2024

Any update? Is the issue still present?

Kin-Zhang added a commit to KTH-RPL/SeFlow that referenced this issue Jul 24, 2024
…lt. HARD CODE now.

check following issues in hydra to sync: facebookresearch/hydra#2070

docs(readme): fix #2 about hyperlink missing in dataprocess commands.
@konstantinos-pitas-cognex

Hello! Any hacky way of getting ddp and hydra multirun to work? Maybe exiting ddp processes in a particular way? Looking forward to this being solved! (hydra is awesome!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants