Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/air] Fix release test timeout for tune_scalability_network_overhead #36360

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Jun 13, 2023

Why are these changes needed?

There are 2 versions of the test: a smoke test version with 20 nodes/trials and a full version with 100 nodes/trials. The smoke test version timed out recently.

The test setups is slightly different:

  • Smoke test runs with m5a.large (2 cpus) x 20 nodes.
  • Full version runs with m5a.4xlarge (16 cpus) head node + 99 x m5a.large worker nodes
  • The smoke test takes longer than the full run due to syncing overhead at the end caused by the smaller head node instance (since all the syncing is going there).

This PR bumps the smoke test's head node instance size, and forces trials to run on worker nodes -- the head node is purely meant to handle syncing in this test. This also fixes a problem that existed before, where the full 100 trial release test would schedule 8 trials on the head node, rather than utilize every node in the cluster.

See #36346 for more context.

Questions

Some tests I ran (10 [session.report](http://session.report) calls, spaced out by 3 seocnds each → 30 seconds ideal run time)

  • With a m5a.large head node (2 cpus):
    • Total run time: 69.03 seconds (35.89 seconds for the tuning loop). > 30 seconds for final head node syncing
  • With a m5a.4xlarge head node (16 cpus):
    • Total run time: 41.94 seconds (35.37 seconds for the tuning loop). ~ 5 seconds for final head node syncing

Is this expected / ok performance for the super tiny head node? This most likely does not reflect real usage.

Related issue number

Closes #36346

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu
Copy link
Contributor Author

justinvyu commented Jun 13, 2023

Release tests kicked off here: https://buildkite.com/ray-project/release-tests-pr/builds/42034

@justinvyu justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 15, 2023
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@krfricke krfricke merged commit 292af08 into ray-project:master Jun 16, 2023
@justinvyu justinvyu deleted the release/tune/test_network_overhead branch June 22, 2023 01:21
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…overhead` (ray-project#36360)

There are 2 versions of the test: a smoke test version with 20 nodes/trials and a full version with 100 nodes/trials. The smoke test version timed out recently.

The test setups is slightly different:
- Smoke test runs with m5a.large (2 cpus) x 20 nodes.
- Full version runs with m5a.4xlarge (16 cpus) head node + 99 x m5a.large worker nodes
- The smoke test takes **longer** than the full run due to syncing overhead at the end caused by the smaller head node instance (since all the syncing is going there).

This PR bumps the smoke test's head node instance size, and forces trials to run on worker nodes -- the head node is purely meant to handle syncing in this test. This also fixes a problem that existed before, where the full 100 trial release test would schedule 8 trials on the head node, rather than utilize every node in the cluster.

See ray-project#36346 for more context.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[release/air] tune_scalability_network_overhead.smoke-test runs longer than the full release test
2 participants