[release/air] `tune_scalability_network_overhead.smoke-test` runs longer than the full release test #36346

justinvyu · 2023-06-12T21:49:07Z

Failed on 6/10/23: Job link

The timing numbers don't make sense:

The smoke test is being run with slightly different compute configs for the head node -- m5a.large (smoke test head node) vs. m5a.4xlarge (full run head node). Hypothesis to confirm: this is probably where the performance discrepancy is happening. See comments below.
The smoke test runs (with 20 trials) consistently take longer than the full 100 trial runs, with more overhead on the driver node. ~~Hypothesis to confirm:~~ this overhead is most likely syncing files between nodes. Confirmed. See metrics below:

Vs. the passing one with a large head node:

The plot of the passing full release test is also a bit suspicious:

We expect to see a final round of syncing at the end of training, but it looks like that gets skipped?

Logs

Timing on an errored (smoke test) run (notice how 500 seconds for total run time > 305 seconds for the Tune loop.

[ERROR 2023-06-10 13:36:08,810] anyscale_job_wrapper.py: 290  Timed out. Time taken: 500.18672198499996
[WARNING 2023-06-10 13:36:08,811] anyscale_job_wrapper.py: 68  Couldn't upload to cloud storage: '/tmp/release_test_out.json' does not exist.
2023-06-10 13:36:18,209 INFO tune.py:1112 -- Total run time: 507.20 seconds (305.73 seconds for the tuning loop).

This means that some extra overhead (waiting on syncs) is happening after the experiment finishes, causing the test to timeout.

Timing on a passing full run:

2023-06-10 15:26:31,434 INFO tune.py:1112 -- Total run time: 316.99 seconds (313.65 seconds for the tuning loop).
The result network overhead test took 317.26 seconds, which is below the budget of 1000.00 seconds. Test successful.

Other fixes to make:

The timeouts should be the same for both tests, since all trials can run concurrently in both the smoke test and the full version.
The timed_tune_run test utility prints out "test success" even though the config in the release tests yaml file is configured at a different timeout threshold. This is confusing for debugging:

--- PASSED: RESULT NETWORK OVERHEAD ::: 507.92 <= 1000.00 ---
...
[ERROR 2023-06-10 13:36:08,810] anyscale_job_wrapper.py: 290  Timed out. Time taken: 500.18672198499996

The text was updated successfully, but these errors were encountered:

justinvyu · 2023-06-12T22:09:41Z

Another issue:

Since the full release test runs with a m5a.4xlarge instance with 16 CPUs, multiple trials get assigned to the head node, rather than a single trial per node like the test intends. Therefore, the smoke test runs with 20 trials all on separate nodes, and the full run has 8 trials on the head node (reducing syncing overhead), and 92 trials on the other nodes.

…overhead` (#36360) There are 2 versions of the test: a smoke test version with 20 nodes/trials and a full version with 100 nodes/trials. The smoke test version timed out recently. The test setups is slightly different: - Smoke test runs with m5a.large (2 cpus) x 20 nodes. - Full version runs with m5a.4xlarge (16 cpus) head node + 99 x m5a.large worker nodes - The smoke test takes **longer** than the full run due to syncing overhead at the end caused by the smaller head node instance (since all the syncing is going there). This PR bumps the smoke test's head node instance size, and forces trials to run on worker nodes -- the head node is purely meant to handle syncing in this test. This also fixes a problem that existed before, where the full 100 trial release test would schedule 8 trials on the head node, rather than utilize every node in the cluster. See #36346 for more context. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…overhead` (ray-project#36360) There are 2 versions of the test: a smoke test version with 20 nodes/trials and a full version with 100 nodes/trials. The smoke test version timed out recently. The test setups is slightly different: - Smoke test runs with m5a.large (2 cpus) x 20 nodes. - Full version runs with m5a.4xlarge (16 cpus) head node + 99 x m5a.large worker nodes - The smoke test takes **longer** than the full run due to syncing overhead at the end caused by the smaller head node instance (since all the syncing is going there). This PR bumps the smoke test's head node instance size, and forces trials to run on worker nodes -- the head node is purely meant to handle syncing in this test. This also fixes a problem that existed before, where the full 100 trial release test would schedule 8 trials on the head node, rather than utilize every node in the cluster. See ray-project#36346 for more context. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

justinvyu added P0 Issues that should be fixed in short order tune Tune-related issues air release-test release test labels Jun 12, 2023

justinvyu assigned justinvyu and can-anyscale Jun 12, 2023

justinvyu mentioned this issue Jun 13, 2023

[release/air] Fix release test timeout for tune_scalability_network_overhead #36360

Merged

8 tasks

krfricke closed this as completed in #36360 Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/air] `tune_scalability_network_overhead.smoke-test` runs longer than the full release test #36346

[release/air] `tune_scalability_network_overhead.smoke-test` runs longer than the full release test #36346

justinvyu commented Jun 12, 2023 •

edited

Loading

justinvyu commented Jun 12, 2023

[release/air] tune_scalability_network_overhead.smoke-test runs longer than the full release test #36346

[release/air] tune_scalability_network_overhead.smoke-test runs longer than the full release test #36346

Comments

justinvyu commented Jun 12, 2023 • edited Loading

Logs

Other fixes to make:

justinvyu commented Jun 12, 2023

[release/air] `tune_scalability_network_overhead.smoke-test` runs longer than the full release test #36346

[release/air] `tune_scalability_network_overhead.smoke-test` runs longer than the full release test #36346

justinvyu commented Jun 12, 2023 •

edited

Loading