Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(job_attachments): enhance handling S3 timeout errors and BotoCoreError #206

Merged
merged 1 commit into from
Mar 19, 2024

Conversation

gahyusuh
Copy link
Contributor

@gahyusuh gahyusuh commented Mar 12, 2024

Improve error handling for S3 requests by

  • adding "retries" configuration to the S3 client
  • adding BotoCoreError handling to cover S3 timeout errors (e.g., ReadTimeoutError, ConnectTimeoutError)

Although not related to the main changes here, made an additional improvement:
Update the script upload_cancel_test.py to create large files in smaller sized chunks to prevent MemoryError.

What was the problem/requirement? (What/Why)

There was an issue where a long and non-user-friendly error stack trace was printed to the console during file uploads when trying to submit a job over a slow internet bandwidth. (it was confirmed that they could GET and DELETE objects in their bucket with permissions, so it seems purely a bandwidth-related issue.) The reported timeout error was ReadTimeoutError, a botocore exception. While we are handling ClientError from botocore, we lacked handling for other errors like ReadTimeoutError.

What was the solution? (How)

  1. Use standard retry mode: include a "retries", setting the mode to "standard", in the S3 client's configuration. This is an more advanced retry logic than the default retry mode called legacy.
  2. Catch BotoCoreError: This botocore exception is a base class for many errors related to boto3 calls, which typically occur during the preparation phase of AWS service calls due to issues like network connection problems, or authentication failures. I have added more user-friendly error messages to better communicate these issues to users.

What is the impact of this change?

More robust error handling for S3 requests, providing clear and non-cumbersome error messages to users when S3 requests failed with BotoCoreError.

How was this change tested?

  • Unit tests: added a couple new unit tests covering the new error handling path.
  • Ran Integ tests and made sure all passed.
  • Manually reproduced the ReadTimeoutError while submitting a job to verify that the new error message is displayed as expected:
Submitting to Queue: Queue02
Job submission contains 1 files totaling 10.74 GB.  All files will be uploaded to S3 if they are not already present in the job attachments bucket.

Do you wish to proceed? [Y/n]: 
Hashing Attachments  [####################################]  100%
Hashing Summary:
    Processed 0 files totaling 0.0 B.
    Skipped re-processing 1 files totaling 10.74 GB.
    Total processing time of 0.0038 seconds at 0.0 B/s.

Failed to upload job attachments:
An issue occurred with AWS service request while checking for the existence of an object in the S3 bucket: Read timeout on endpoint URL: "https://s3.us-west-2.amazonaws.com/job-attachments-bucket/rootPrefix/Data/818cb10ad378806591de713e78b9b1f9.xxh128"
This could be due to temporary issues with AWS, internet connection, or your AWS credentials. Please verify your credentials an network connection. If the problem persists, try again later or contact support for further assistance.

Was this change documented?

No.

Is this a breaking change?

No.

@gahyusuh gahyusuh marked this pull request as ready for review March 12, 2024 18:21
@gahyusuh gahyusuh requested a review from a team as a code owner March 12, 2024 18:21
Copy link
Contributor

@marofke marofke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here recreating the error and making a much better UX from timeouts!

LGTM! 🚢 🦈

@gahyusuh gahyusuh force-pushed the gahyusuh/ja_s3_timeout branch 2 times, most recently from a0eee74 to e4c309d Compare March 15, 2024 17:23
…eError

Improve error handling for S3 requests by
- adding "retries" configuration to the S3 client
- adding BotoCoreError handling to cover S3 timeout errors (e.g., ReadTimeoutError, ConnectTimeoutError)

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
@gahyusuh gahyusuh merged commit 24fe21c into mainline Mar 19, 2024
18 checks passed
@gahyusuh gahyusuh deleted the gahyusuh/ja_s3_timeout branch March 19, 2024 14:52
baxeaz pushed a commit that referenced this pull request Mar 21, 2024
…eError (#206)

Improve error handling for S3 requests by
- adding "retries" configuration to the S3 client
- adding BotoCoreError handling to cover S3 timeout errors (e.g., ReadTimeoutError, ConnectTimeoutError)

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>
npmacl pushed a commit that referenced this pull request Mar 21, 2024
…eError (#206)

Improve error handling for S3 requests by
- adding "retries" configuration to the S3 client
- adding BotoCoreError handling to cover S3 timeout errors (e.g., ReadTimeoutError, ConnectTimeoutError)

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
baxeaz added a commit that referenced this pull request Mar 22, 2024
* Switch to running deadline_vfs as os_user

Signed-off-by: Brian Axelson <baxelson@amazon.com>

* feat(job_attachments): enhance handling S3 timeout errors and BotoCoreError (#206)

Improve error handling for S3 requests by
- adding "retries" configuration to the S3 client
- adding BotoCoreError handling to cover S3 timeout errors (e.g., ReadTimeoutError, ConnectTimeoutError)

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>

* fix(job_attachments): Use files' last modification time to identify output files to be synced (#211)

Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>

* chore(deps): update python-semantic-release requirement (#216)

Updates the requirements on [python-semantic-release](https://github.com/python-semantic-release/python-semantic-release) to permit the latest version.
- [Release notes](https://github.com/python-semantic-release/python-semantic-release/releases)
- [Changelog](https://github.com/python-semantic-release/python-semantic-release/blob/master/CHANGELOG.md)
- [Commits](python-semantic-release/python-semantic-release@v8.7.0...v9.2.2)

---
updated-dependencies:
- dependency-name: python-semantic-release
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>

* chore(release): 0.41.0 (#217)

Signed-off-by: client-software-ci <129794699+client-software-ci@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>

* chore(deps): update coverage[toml] requirement from ~=7.2 to ~=7.4 (#156)

Updates the requirements on [coverage[toml]](https://github.com/nedbat/coveragepy) to permit the latest version.
- [Release notes](https://github.com/nedbat/coveragepy/releases)
- [Changelog](https://github.com/nedbat/coveragepy/blob/master/CHANGES.rst)
- [Commits](nedbat/coveragepy@7.3.0...7.4.0)

---
updated-dependencies:
- dependency-name: coverage[toml]
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>

* fix: Make StorageProfileOperatingSystemFamily enum case-insensitive

Signed-off-by: Caden Marofke <marofke@amazon.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>

* ci: add gpg signing of build artifacts (#218)

Signed-off-by: Charles Moore <122481442+moorec-aws@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>

* feat!: prep for rootPathFormat becoming ALL UPPERS (#222)

** BREAKING CHANGE **
* The PathFormat enum's values went from all lowercase to all uppercase
* The source_path_root in the path mapping rules return value from sync_inputs went from all lowercase to all uppercase

Signed-off-by: Morgan Epp <60796713+epmog@users.noreply.github.com>
Signed-off-by: Brian Axelson <baxelson@amazon.com>

* CR Feedback

Signed-off-by: Brian Axelson <baxelson@amazon.com>

* Cleaning up a few more 'executing the job' cases

Signed-off-by: Brian Axelson <baxelson@amazon.com>

---------

Signed-off-by: Brian Axelson <baxelson@amazon.com>
Signed-off-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: client-software-ci <129794699+client-software-ci@users.noreply.github.com>
Signed-off-by: Caden Marofke <marofke@amazon.com>
Signed-off-by: Charles Moore <122481442+moorec-aws@users.noreply.github.com>
Signed-off-by: Morgan Epp <60796713+epmog@users.noreply.github.com>
Co-authored-by: Gahyun Suh <132245153+gahyusuh@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: client-software-ci <129794699+client-software-ci@users.noreply.github.com>
Co-authored-by: Caden Marofke <marofke@amazon.com>
Co-authored-by: Charles Moore <122481442+moorec-aws@users.noreply.github.com>
Co-authored-by: Morgan Epp <60796713+epmog@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants