Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-run with hydra + DDP #18175

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nisheethlahoti
Copy link
Contributor

@nisheethlahoti nisheethlahoti commented Jul 27, 2023

Support multi-run with hydra + DDP (when output_subdir is non-None).

Implementation of the idea from #11617 in a restricted setting (when saved config is actually present).

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Jul 27, 2023
@nisheethlahoti nisheethlahoti changed the title Hydra + DDP improvements Support multi-run with hydra + DDP Jul 28, 2023
@nisheethlahoti nisheethlahoti marked this pull request as ready for review July 31, 2023 11:14
@nisheethlahoti nisheethlahoti marked this pull request as draft July 31, 2023 11:45
cwd = get_original_cwd()
rundir = f'"{HydraConfig.get().run.dir}"'
# Set output_subdir null since we don't want different subprocesses trying to write to config.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is still useful, we could move it down to the corresponding line.

raise RuntimeError("DDP with multirun requires saved config file")
else: # Used saved config for new run
hydra_subdir = rundir / hydra_cfg.output_subdir
command += ["-cp", str(hydra_subdir), "-cn", "config.yaml"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the name of the file always guaranteed to be config.yaml?

@awaelchli awaelchli added feature Is an improvement or enhancement 3rd party Related to a 3rd-party labels Jul 31, 2023
@awaelchli awaelchli added this to the 2.1 milestone Jul 31, 2023
@awaelchli awaelchli added the community This PR is from the community label Jul 31, 2023
@awaelchli
Copy link
Contributor

Hey @nisheethlahoti
Would you like to continue this work? No pressure, just want to make sure it the PR goes stale :)

@nisheethlahoti
Copy link
Contributor Author

@awaelchli I realized that the implementation above isn't entirely correct (it doesn't properly handle user overrides that start with hydra.). I know the basic sketch of a correct implementation, but don't have time to get it done right now. Maybe in a couple weeks.

@libokj
Copy link

libokj commented Sep 14, 2023

@nisheethlahoti @awaelchli I ran into the problem of a sweep/multirun of multiple PL training jobs with DDP resulting in spawned ddp processes (train_ddp_process_1, etc.) creating multiple output folders of the same multirun, each folder containing a copy of multirun.yaml and DDP subprocess log files. This is partially because my hydra.sweep.dir is created dynamically using the resolver now, and of course subprocesses are not launched at the same time as the main one. Another thing is lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd doesn't take RunMode into consideration so that in a multirun hydra.sweep.dir is not used to override a DDP subprocess.

I came up with my own solution. This is the monkey patch I use:

def _hydra_subprocess_cmd(local_rank: int):
    """
    Monkey patching for lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd
    Temporarily fixes the problem of unnecessarily creating log folders for DDP subprocesses in Hydra multirun/sweep.
    """
    import __main__  # local import to avoid https://github.com/Lightning-AI/lightning/issues/15218
    from hydra.core.hydra_config import HydraConfig
    from hydra.types import RunMode
    from hydra.utils import get_original_cwd, to_absolute_path

    # when user is using hydra find the absolute path
    if __main__.__spec__ is None:  # pragma: no-cover
        command = [sys.executable, to_absolute_path(sys.argv[0])]
    else:
        command = [sys.executable, "-m", __main__.__spec__.name]

    command += sys.argv[1:] + [f"hydra.job.name=train_ddp_process_{local_rank}", "hydra.output_subdir=null"]

    cwd = get_original_cwd()
    config = HydraConfig.get()

    if config.mode == RunMode.RUN:
        command += [f'hydra.run.dir="{config.run.dir}"']
    elif config.mode == RunMode.MULTIRUN:
        command += [f'hydra.sweep.dir="{config.sweep.dir}"']

    print(command)

    return command, cwd

I think in your solution using hydra.runtime.output_dir at https://github.com/Lightning-AI/lightning/blob/8d0dbe83b15cc1be970af626015d72459e66feed/src/lightning/fabric/strategies/launchers/subprocess_script.py#L158-L168 may not be a good idea if hydra.run.dir is created dynamically using resolvers like now or other custom resolvers defined by the user. It doesn't seem necessary to pass config.yaml to the subprocess as long as you take hydra.sweep.dir into consideration. Maybe I'm missing something? The only minor problem my solution is not able to solve is that the last ran subprocess will create a multirun.yaml that overwrites the one created by the main process, but that's a Hydra problem (no option to disable multirun.yaml creation in Hydra anywhere yet). You can see my comments in
facebookresearch/hydra#2070 for more details.

Copy link

gitguardian bot commented Jan 16, 2024

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id Secret Commit Filename
- Generic High Entropy Secret 78fa3af tests/tests_app/utilities/test_login.py View secret
- Base64 Basic Authentication 78fa3af tests/tests_app/utilities/test_login.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@awaelchli awaelchli modified the milestones: 2.2, 2.3 Feb 3, 2024
@awaelchli awaelchli modified the milestones: 2.3, future Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party community This PR is from the community fabric lightning.fabric.Fabric feature Is an improvement or enhancement pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants