Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support retries when reading from network resources #802

Open
Kirill888 opened this issue Oct 2, 2019 · 4 comments
Open

Support retries when reading from network resources #802

Kirill888 opened this issue Oct 2, 2019 · 4 comments

Comments

@Kirill888
Copy link
Member

Introduction

Datacube now supports network resources, particularly Cloud Optimized GeoTIFFs residing on S3 or other HTTP based storage systems. Network resources might experience intermittent failures. For example when working with S3 one can hit request limits, in which case server will respond with 5XX response. GDAL does not attempt to retry internally as far as I can tell, neither does rasterio currently. dc.load has an option to ignore files that failed to load skip_broken_datasets=True, but this was meant to work around slightly out of date DB index (when files were deleted from the file system but not from DB), it's not good idea to use that option in the cloud environment.

Proposed Structure

Allow user to specify what datacube should do in the event of failure to open or read some network resource. For example on_error callback to supply to dc.load|dc.load_data. Callback will be given information about the failure:

  • operation that failed open|read
  • resource that failed s3://bucket/path/file.tiff
  • failure count starting from 1, counting up

User supplied callback can then specify what to do next, with options being:

  • Fail silently skipping this data only (equivalent to current skip_broken_datasets=True)
  • Fail with IOError (current default behaviour)
  • Try again (no current equivalent)

Using callback it's up to the user to implement back off strategy (callback can sleep for a while before indicating to attempt to retry). We can then implement a number of common retry error handling strategies that user can choose for their environment.

@Kirill888
Copy link
Member Author

Correction: GDAL does support retries see this commit 81bed71

Nevertheless having more control over what happens when load failure occurs is still usefull.

@stale
Copy link

stale bot commented Aug 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 8, 2020
@stale stale bot closed this as completed Oct 7, 2020
@benjimin
Copy link
Contributor

benjimin commented Sep 24, 2024

This is a fundamental problem for the ODC API because it is designed to support deterministic operations on petabyte-scale data collections.

For a large compute job (involving massive ODC API reads) it is unacceptable if the API intermittently just omits a dataset from a time-series (leading to incorrect or non-reproducible results), and is also unacceptable if the job hangs or aborts in failure unnecessarily (potentially wasting usage-time of substantial compute resources). The likelihood might also be exacerbated by collection management such as S3 intelligent tiering.

The relevant upstream issue seems to be rasterio/rasterio#2119

We're still finding user reports of intermittent failures of large reads. RasterioIOError: '/vsis3/dea-public-data/baseline/ga_s2bm_ard_3/....tif' not recognized as a supported file format.

Unfortunately this looks difficult to fix:

  • It's intermittent and difficult to replicate.
  • GDAL does not expose much of the HTTP interaction process (so we readily can't target a backoff-retry mechanism specifically to scaling-error responses).
  • ODC skip_broken_datasets will silently produce nondeterministic output.
  • Substantially increasing GDAL_HTTP_MAX_RETRY should reduce the frequency of incidence but will not eliminate it for large-scale jobs (potentially making it harder to test and handle properly, and potentially making unrelated problems exhibit more wasteful failures).

@benjimin benjimin reopened this Sep 24, 2024
@benjimin benjimin removed the wontfix label Sep 24, 2024
@benjimin
Copy link
Contributor

benjimin commented Sep 24, 2024

AWS S3 documents that 503 Slow Down responses may occur if the application does not scale gradually enough. (500 Internal Error is also possible. If the request rate did scale up gradually, then S3 should be able to sustain more than 5000 requests per second to every "folder" or "prefix".)
https://repost.aws/knowledge-center/http-5xx-errors-s3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants