Skip to content

Commit

Permalink
[Data] [Release Test] Add AWS ACCESS_DENIED as retryable exception …
Browse files Browse the repository at this point in the history
…for multi-node Data+Train benchmarks (#47232)

## Why are these changes needed?

For release tests like `read_images_train_1_gpu_5_cpu`,
`read_images_train_4_gpu`, `read_images_train_16_gpu`, and their
variants, we observe `AWS ACCESS_DENIED` errors somewhat consistently,
but not every time. By default, we do not retry on `ACCESS_DENIED`
because `ACCESS_DENIED` can be raised in multiple situations, and does
not necessarily stem from authentication failures; hence we cannot
distinguish auth errors from other unrelated transient errors. See
#47230 for more details on the
underlying issue.

For the purpose of this release test, we don't foresee authentication
issues, so we add `ACCESS_DENIED` as a retryable exception type, to
avoid failures for transient errors.

## Related issue number

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
- [x] Release tests -
https://buildkite.com/ray-project/release/builds/21397
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <sjl@anyscale.com>
  • Loading branch information
scottjlee authored Aug 21, 2024
1 parent c01d524 commit 2063395
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion release/nightly_tests/dataset/multi_node_train_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -571,8 +571,13 @@ def __iter__(self):
def benchmark_code(
args,
):
ctx = ray.data.DataContext.get_current()
# This release test runs into ACCESS_DENIED errors fairly often.
# We add ACCESS_DENIED as a retryable exception type to avoid flakiness.
# See for more details: https://github.com/ray-project/ray/issues/47230
ctx.retried_io_errors.append("AWS Error ACCESS_DENIED")

if args.target_max_block_size_mb is not None:
ctx = ray.data.DataContext.get_current()
ctx.target_max_block_size = args.target_max_block_size_mb * 1024 * 1024

cache_input_ds = args.cache_input_ds
Expand Down

0 comments on commit 2063395

Please sign in to comment.