Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/air] tune_scalability_network_overhead.smoke-test runs longer than the full release test #36346

Closed
justinvyu opened this issue Jun 12, 2023 · 1 comment · Fixed by #36360
Assignees
Labels
P0 Issues that should be fixed in short order release-test release test tune Tune-related issues

Comments

@justinvyu
Copy link
Contributor

justinvyu commented Jun 12, 2023

Failed on 6/10/23: Job link

The timing numbers don't make sense:

  • The smoke test is being run with slightly different compute configs for the head node -- m5a.large (smoke test head node) vs. m5a.4xlarge (full run head node). Hypothesis to confirm: this is probably where the performance discrepancy is happening. See comments below.
  • The smoke test runs (with 20 trials) consistently take longer than the full 100 trial runs, with more overhead on the driver node. Hypothesis to confirm: this overhead is most likely syncing files between nodes. Confirmed. See metrics below:
Screen Shot 2023-06-12 at 3 32 07 PM

Vs. the passing one with a large head node:

Screen Shot 2023-06-12 at 3 36 57 PM

The plot of the passing full release test is also a bit suspicious:

  • We expect to see a final round of syncing at the end of training, but it looks like that gets skipped?

Logs

Timing on an errored (smoke test) run (notice how 500 seconds for total run time > 305 seconds for the Tune loop.

[ERROR 2023-06-10 13:36:08,810] anyscale_job_wrapper.py: 290  Timed out. Time taken: 500.18672198499996
[WARNING 2023-06-10 13:36:08,811] anyscale_job_wrapper.py: 68  Couldn't upload to cloud storage: '/tmp/release_test_out.json' does not exist.
2023-06-10 13:36:18,209 INFO tune.py:1112 -- Total run time: 507.20 seconds (305.73 seconds for the tuning loop).

This means that some extra overhead (waiting on syncs) is happening after the experiment finishes, causing the test to timeout.

Timing on a passing full run:

2023-06-10 15:26:31,434 INFO tune.py:1112 -- Total run time: 316.99 seconds (313.65 seconds for the tuning loop).
The result network overhead test took 317.26 seconds, which is below the budget of 1000.00 seconds. Test successful. 

Other fixes to make:

  • The timeouts should be the same for both tests, since all trials can run concurrently in both the smoke test and the full version.
  • The timed_tune_run test utility prints out "test success" even though the config in the release tests yaml file is configured at a different timeout threshold. This is confusing for debugging:
--- PASSED: RESULT NETWORK OVERHEAD ::: 507.92 <= 1000.00 ---
...
[ERROR 2023-06-10 13:36:08,810] anyscale_job_wrapper.py: 290  Timed out. Time taken: 500.18672198499996
@justinvyu justinvyu added P0 Issues that should be fixed in short order tune Tune-related issues air release-test release test labels Jun 12, 2023
@justinvyu
Copy link
Contributor Author

Another issue:

  • Since the full release test runs with a m5a.4xlarge instance with 16 CPUs, multiple trials get assigned to the head node, rather than a single trial per node like the test intends. Therefore, the smoke test runs with 20 trials all on separate nodes, and the full run has 8 trials on the head node (reducing syncing overhead), and 92 trials on the other nodes.

krfricke pushed a commit that referenced this issue Jun 16, 2023
…overhead` (#36360)

There are 2 versions of the test: a smoke test version with 20 nodes/trials and a full version with 100 nodes/trials. The smoke test version timed out recently.

The test setups is slightly different:
- Smoke test runs with m5a.large (2 cpus) x 20 nodes.
- Full version runs with m5a.4xlarge (16 cpus) head node + 99 x m5a.large worker nodes
- The smoke test takes **longer** than the full run due to syncing overhead at the end caused by the smaller head node instance (since all the syncing is going there).

This PR bumps the smoke test's head node instance size, and forces trials to run on worker nodes -- the head node is purely meant to handle syncing in this test. This also fixes a problem that existed before, where the full 100 trial release test would schedule 8 trials on the head node, rather than utilize every node in the cluster.

See #36346 for more context.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this issue Aug 31, 2023
…overhead` (ray-project#36360)

There are 2 versions of the test: a smoke test version with 20 nodes/trials and a full version with 100 nodes/trials. The smoke test version timed out recently.

The test setups is slightly different:
- Smoke test runs with m5a.large (2 cpus) x 20 nodes.
- Full version runs with m5a.4xlarge (16 cpus) head node + 99 x m5a.large worker nodes
- The smoke test takes **longer** than the full run due to syncing overhead at the end caused by the smaller head node instance (since all the syncing is going there).

This PR bumps the smoke test's head node instance size, and forces trials to run on worker nodes -- the head node is purely meant to handle syncing in this test. This also fixes a problem that existed before, where the full 100 trial release test would schedule 8 trials on the head node, rather than utilize every node in the cluster.

See ray-project#36346 for more context.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Issues that should be fixed in short order release-test release test tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants