Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Data] [Release Test] Add
AWS ACCESS_DENIED
as retryable exception …
…for multi-node Data+Train benchmarks (#47232) ## Why are these changes needed? For release tests like `read_images_train_1_gpu_5_cpu`, `read_images_train_4_gpu`, `read_images_train_16_gpu`, and their variants, we observe `AWS ACCESS_DENIED` errors somewhat consistently, but not every time. By default, we do not retry on `ACCESS_DENIED` because `ACCESS_DENIED` can be raised in multiple situations, and does not necessarily stem from authentication failures; hence we cannot distinguish auth errors from other unrelated transient errors. See #47230 for more details on the underlying issue. For the purpose of this release test, we don't foresee authentication issues, so we add `ACCESS_DENIED` as a retryable exception type, to avoid failures for transient errors. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [x] Release tests - https://buildkite.com/ray-project/release/builds/21397 - [ ] This PR is not tested :( --------- Signed-off-by: Scott Lee <sjl@anyscale.com>
- Loading branch information