Add option to launch distributed runs locally with >1 GPU #733

rayg1234 · 2024-06-20T16:47:01Z

Add option to launch distributed runs locally with >1 GPU. Useful for testing parallel algorithms locally. This uses torch elastic API which just spawns python multiprocesses under the hood.

This is equivalent to calling our application with torchrun ie: torchrun fairchem ..., but makes the interface cleaner so we dont need to work with 2 launchers
Note: torchrun just calls the elastic launch API under the hood

There's a bug where LMBDs cannot be pickled (needed for multiprocessing), this is resolvable by setting num_workers to 0 which is ok for local mode testing.

examples:
To run locally on 2 GPUs with distributed:
fairchem --debug --mode train --identifier gp_test --config-yml src/fairchem/experimental/rgao/configs/equiformer_v2_N\@8_L\@4_M\@2_31M.yml --amp --distributed --num-gpus=2

To run locally without distributed:
fairchem --debug --mode train --identifier gp_test --config-yml src/fairchem/experimental/rgao/configs/equiformer_v2_N\@8_L\@4_M\@2_31M.yml --amp

Testing:

Add simple test for test_cli.py for now which mocks the runner, should add tests later for actual simple runs

codecov · 2024-06-21T06:35:16Z

Codecov Report

Attention: Patch coverage is 76.47059% with 4 lines in your changes missing coverage. Please review.

Files	Coverage Δ
src/fairchem/core/_cli.py	`64.51% <86.66%> (+18.68%)`	⬆️
src/fairchem/core/common/distutils.py	`30.00% <0.00%> (ø)`

misko · 2024-06-21T22:19:04Z

This is awesome! exactly what we need 🤩 LGTM!

* use torch elastic api to launch multiple gpu local mode * fix distutils.py * add test * lint * lint * basic test no distributed * update test --------- Co-authored-by: Luis Barroso-Luque <lbluque@users.noreply.github.com>

* use torch elastic api to launch multiple gpu local mode * fix distutils.py * add test * lint * lint * basic test no distributed * update test --------- Co-authored-by: Luis Barroso-Luque <lbluque@users.noreply.github.com> Former-commit-id: 0f1ddd48a4bc90a22c43922a4150ee74cfd4fb95

) * use torch elastic api to launch multiple gpu local mode * fix distutils.py * add test * lint * lint * basic test no distributed * update test --------- Co-authored-by: Luis Barroso-Luque <lbluque@users.noreply.github.com> Former-commit-id: 8a984910fe5738eb11cdf7d3dcee26c76a3d5763

rayg1234 added 3 commits June 19, 2024 21:56

use torch elastic api to launch multiple gpu local mode

34d317f

fix distutils.py

51b5f85

add test

6f85284

rayg1234 requested review from misko and anuroopsriram June 20, 2024 16:49

rayg1234 and others added 5 commits June 20, 2024 16:51

lint

9d8c1c3

lint

e8ed53c

Merge branch 'main' into use_distributed_for_local

b954fc2

basic test no distributed

fb44b09

update test

5dd8e8d

Merge branch 'main' into use_distributed_for_local

407778a

misko approved these changes Jun 21, 2024

View reviewed changes

rayg1234 added this pull request to the merge queue Jun 21, 2024

Merged via the queue into main with commit fa23491 Jun 21, 2024
5 checks passed

rayg1234 deleted the use_distributed_for_local branch June 21, 2024 22:39

rayg1234 added the enhancement New feature or request label Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to launch distributed runs locally with >1 GPU #733

Add option to launch distributed runs locally with >1 GPU #733

rayg1234 commented Jun 20, 2024 •

edited

Loading

codecov bot commented Jun 21, 2024 •

edited

Loading

misko commented Jun 21, 2024

Add option to launch distributed runs locally with >1 GPU #733

Add option to launch distributed runs locally with >1 GPU #733

Conversation

rayg1234 commented Jun 20, 2024 • edited Loading

codecov bot commented Jun 21, 2024 • edited Loading

Codecov Report

misko commented Jun 21, 2024

rayg1234 commented Jun 20, 2024 •

edited

Loading

codecov bot commented Jun 21, 2024 •

edited

Loading