[release/air] tune_scalability_network_overhead.smoke-test
runs longer than the full release test
#36346
Labels
tune_scalability_network_overhead.smoke-test
runs longer than the full release test
#36346
Failed on 6/10/23: Job link
The timing numbers don't make sense:
m5a.large
(smoke test head node) vs.m5a.4xlarge
(full run head node). Hypothesis to confirm: this is probably where the performance discrepancy is happening. See comments below.Hypothesis to confirm:this overhead is most likely syncing files between nodes. Confirmed. See metrics below:Vs. the passing one with a large head node:
The plot of the passing full release test is also a bit suspicious:
Logs
Timing on an errored (smoke test) run (notice how 500 seconds for total run time > 305 seconds for the Tune loop.
This means that some extra overhead (waiting on syncs) is happening after the experiment finishes, causing the test to timeout.
Timing on a passing full run:
Other fixes to make:
timed_tune_run
test utility prints out "test success" even though the config in the release tests yaml file is configured at a different timeout threshold. This is confusing for debugging:The text was updated successfully, but these errors were encountered: