Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] <Part 4> Lightning Trainer Release tests + docstring sample test #33323

Merged
merged 23 commits into from
Mar 24, 2023

Conversation

woshiyyya
Copy link
Member

@woshiyyya woshiyyya commented Mar 15, 2023

Why are these changes needed?

This PR aims to add release tests and docstring tests for LightningTrainer.

The release tests:

  • Cluster setting: 3 x g4dn.4xlarge (48 CPUs + 3 GPUs)
  • lightning_gpu_train_3x16_3x1:
    • Test end-to-end training on MNIST dataset
    • DDP training with 3 GPUs
  • lightning_gpu_tune_3x16_3x1:
    • Test LightningTrainer integration with Ray Tune
    • 4 Trails + PBT scheduler

Release Test Job succeed:

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya changed the title [AIR] Lightning Trainer Release tests + docstring sample test [AIR] <Part 4> Lightning Trainer Release tests + docstring sample test Mar 15, 2023
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya marked this pull request as ready for review March 15, 2023 17:07
@woshiyyya woshiyyya requested a review from amogkam March 15, 2023 17:07
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya force-pushed the air/lightning_release_test branch from 93b0306 to f53031f Compare March 15, 2023 19:21
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
…ease_test

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya force-pushed the air/lightning_release_test branch from 20a3a2d to 16a33d4 Compare March 16, 2023 00:29
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can we also remove the existing ray lightning release test if it is being subsumed by the new release test?

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Have a few suggestions.

A few other things:

  • We can combine test_tuner and test_trainer. Just switch between the two with a command line arg.
  • For test_tuner, can we also use a scheduler like PBT? I know we don't want to add too much to one test, but this way we'd test out the PTL checkpointing/restoration logic in the release test.

python/ray/train/lightning/lightning_trainer.py Outdated Show resolved Hide resolved
release/lightning_tests/compute_tpl.yaml Outdated Show resolved Hide resolved
release/lightning_tests/compute_tpl.yaml Show resolved Hide resolved
release/release_tests.yaml Show resolved Hide resolved
release/release_tests.yaml Show resolved Hide resolved
release/release_tests.yaml Outdated Show resolved Hide resolved
release/lightning_tests/workloads/test_tuner.py Outdated Show resolved Hide resolved
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks solid. can you respond to all of Justin's comments?
also do you know, or maybe you can ask Kai quickly, if we can release test a script in our doc/examples folder?
that way, we don't have to duplicate code.

@justinvyu
Copy link
Contributor

@gjoliver I think it has to be in release folder. But, you can create a relative symlink to in the doc/examples folder.

Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya requested a review from rkooo567 as a code owner March 23, 2023 22:51
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
@woshiyyya woshiyyya removed the request for review from rkooo567 March 24, 2023 01:17
@gjoliver gjoliver merged commit 08699c5 into ray-project:master Mar 24, 2023
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
ray-project#33323)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
ray-project#33323)

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants