[release/air] Fix release test timeout for `tune_scalability_network_overhead` #36360

justinvyu · 2023-06-13T00:38:02Z

Why are these changes needed?

There are 2 versions of the test: a smoke test version with 20 nodes/trials and a full version with 100 nodes/trials. The smoke test version timed out recently.

The test setups is slightly different:

Smoke test runs with m5a.large (2 cpus) x 20 nodes.
Full version runs with m5a.4xlarge (16 cpus) head node + 99 x m5a.large worker nodes
The smoke test takes longer than the full run due to syncing overhead at the end caused by the smaller head node instance (since all the syncing is going there).

This PR bumps the smoke test's head node instance size, and forces trials to run on worker nodes -- the head node is purely meant to handle syncing in this test. This also fixes a problem that existed before, where the full 100 trial release test would schedule 8 trials on the head node, rather than utilize every node in the cluster.

See #36346 for more context.

Questions

Some tests I ran (10 [session.report](http://session.report) calls, spaced out by 3 seocnds each → 30 seconds ideal run time)

With a m5a.large head node (2 cpus):
- Total run time: 69.03 seconds (35.89 seconds for the tuning loop). > 30 seconds for final head node syncing
With a m5a.4xlarge head node (16 cpus):
- Total run time: 41.94 seconds (35.37 seconds for the tuning loop). ~ 5 seconds for final head node syncing

Is this expected / ok performance for the super tiny head node? This most likely does not reflect real usage.

Related issue number

Closes #36346

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-06-13T00:38:53Z

Release tests kicked off here: https://buildkite.com/ray-project/release-tests-pr/builds/42034

krfricke

Thanks!

…overhead` (ray-project#36360) There are 2 versions of the test: a smoke test version with 20 nodes/trials and a full version with 100 nodes/trials. The smoke test version timed out recently. The test setups is slightly different: - Smoke test runs with m5a.large (2 cpus) x 20 nodes. - Full version runs with m5a.4xlarge (16 cpus) head node + 99 x m5a.large worker nodes - The smoke test takes **longer** than the full run due to syncing overhead at the end caused by the smaller head node instance (since all the syncing is going there). This PR bumps the smoke test's head node instance size, and forces trials to run on worker nodes -- the head node is purely meant to handle syncing in this test. This also fixes a problem that existed before, where the full 100 trial release test would schedule 8 trials on the head node, rather than utilize every node in the cluster. See ray-project#36346 for more context. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

justinvyu added 5 commits June 12, 2023 15:00

Bump head node size

a78c79a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Decrease test timeout on anyscale

acffbe1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Don't run trials on the head node for these tests

6ee70b9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Raise in the release test after writing release test metrics

cd76181

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray

e7d7fc1

justinvyu requested review from krfricke and can-anyscale June 13, 2023 00:38

justinvyu assigned krfricke Jun 13, 2023

justinvyu added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jun 15, 2023

krfricke approved these changes Jun 16, 2023

View reviewed changes

krfricke merged commit 292af08 into ray-project:master Jun 16, 2023

justinvyu deleted the release/tune/test_network_overhead branch June 22, 2023 01:21

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/air] Fix release test timeout for `tune_scalability_network_overhead` #36360

[release/air] Fix release test timeout for `tune_scalability_network_overhead` #36360

justinvyu commented Jun 13, 2023 •

edited

Loading

justinvyu commented Jun 13, 2023 •

edited

Loading

krfricke left a comment

[release/air] Fix release test timeout for tune_scalability_network_overhead #36360

[release/air] Fix release test timeout for tune_scalability_network_overhead #36360

Conversation

justinvyu commented Jun 13, 2023 • edited Loading

Why are these changes needed?

Questions

Related issue number

Checks

justinvyu commented Jun 13, 2023 • edited Loading

krfricke left a comment

Choose a reason for hiding this comment

[release/air] Fix release test timeout for `tune_scalability_network_overhead` #36360

[release/air] Fix release test timeout for `tune_scalability_network_overhead` #36360

justinvyu commented Jun 13, 2023 •

edited

Loading

justinvyu commented Jun 13, 2023 •

edited

Loading