Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to launch distributed runs locally with >1 GPU #733

Merged
merged 9 commits into from
Jun 21, 2024

Conversation

rayg1234
Copy link
Collaborator

@rayg1234 rayg1234 commented Jun 20, 2024

Add option to launch distributed runs locally with >1 GPU. Useful for testing parallel algorithms locally. This uses torch elastic API which just spawns python multiprocesses under the hood.

This is equivalent to calling our application with torchrun ie: torchrun fairchem ..., but makes the interface cleaner so we dont need to work with 2 launchers
Note: torchrun just calls the elastic launch API under the hood

There's a bug where LMBDs cannot be pickled (needed for multiprocessing), this is resolvable by setting num_workers to 0 which is ok for local mode testing.

examples:
To run locally on 2 GPUs with distributed:
fairchem --debug --mode train --identifier gp_test --config-yml src/fairchem/experimental/rgao/configs/equiformer_v2_N\@8_L\@4_M\@2_31M.yml --amp --distributed --num-gpus=2

To run locally without distributed:
fairchem --debug --mode train --identifier gp_test --config-yml src/fairchem/experimental/rgao/configs/equiformer_v2_N\@8_L\@4_M\@2_31M.yml --amp

Testing:

Add simple test for test_cli.py for now which mocks the runner, should add tests later for actual simple runs

@rayg1234 rayg1234 requested review from misko and anuroopsriram June 20, 2024 16:49
Copy link

codecov bot commented Jun 21, 2024

Codecov Report

Attention: Patch coverage is 76.47059% with 4 lines in your changes missing coverage. Please review.

Files Coverage Δ
src/fairchem/core/_cli.py 64.51% <86.66%> (+18.68%) ⬆️
src/fairchem/core/common/distutils.py 30.00% <0.00%> (ø)

@misko
Copy link
Collaborator

misko commented Jun 21, 2024

This is awesome! exactly what we need 🤩 LGTM!

@rayg1234 rayg1234 added this pull request to the merge queue Jun 21, 2024
Merged via the queue into main with commit fa23491 Jun 21, 2024
5 checks passed
@rayg1234 rayg1234 deleted the use_distributed_for_local branch June 21, 2024 22:39
levineds pushed a commit that referenced this pull request Jul 11, 2024
* use torch elastic api to launch multiple gpu local mode

* fix distutils.py

* add test

* lint

* lint

* basic test no distributed

* update test

---------

Co-authored-by: Luis Barroso-Luque <lbluque@users.noreply.github.com>
@rayg1234 rayg1234 added the enhancement New feature or request label Aug 13, 2024
misko pushed a commit that referenced this pull request Jan 17, 2025
* use torch elastic api to launch multiple gpu local mode

* fix distutils.py

* add test

* lint

* lint

* basic test no distributed

* update test

---------

Co-authored-by: Luis Barroso-Luque <lbluque@users.noreply.github.com>
Former-commit-id: 0f1ddd48a4bc90a22c43922a4150ee74cfd4fb95
beomseok-kang pushed a commit to beomseok-kang/fairchem that referenced this pull request Jan 27, 2025
)

* use torch elastic api to launch multiple gpu local mode

* fix distutils.py

* add test

* lint

* lint

* basic test no distributed

* update test

---------

Co-authored-by: Luis Barroso-Luque <lbluque@users.noreply.github.com>
Former-commit-id: 8a984910fe5738eb11cdf7d3dcee26c76a3d5763
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants