Support retries when reading from network resources #802

Kirill888 · 2019-10-02T02:13:40Z

Introduction

Datacube now supports network resources, particularly Cloud Optimized GeoTIFFs residing on S3 or other HTTP based storage systems. Network resources might experience intermittent failures. For example when working with S3 one can hit request limits, in which case server will respond with 5XX response. GDAL does not attempt to retry internally as far as I can tell, neither does rasterio currently. dc.load has an option to ignore files that failed to load skip_broken_datasets=True, but this was meant to work around slightly out of date DB index (when files were deleted from the file system but not from DB), it's not good idea to use that option in the cloud environment.

Proposed Structure

Allow user to specify what datacube should do in the event of failure to open or read some network resource. For example on_error callback to supply to dc.load|dc.load_data. Callback will be given information about the failure:

operation that failed open|read
resource that failed s3://bucket/path/file.tiff
failure count starting from 1, counting up

User supplied callback can then specify what to do next, with options being:

Fail silently skipping this data only (equivalent to current skip_broken_datasets=True)
Fail with IOError (current default behaviour)
Try again (no current equivalent)

Using callback it's up to the user to implement back off strategy (callback can sleep for a while before indicating to attempt to retry). We can then implement a number of common retry error handling strategies that user can choose for their environment.

The text was updated successfully, but these errors were encountered:

Kirill888 · 2019-11-13T03:39:22Z

Correction: GDAL does support retries see this commit 81bed71

Nevertheless having more control over what happens when load failure occurs is still usefull.

stale · 2020-08-08T06:50:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

benjimin · 2024-09-24T00:31:18Z

This is a fundamental problem for the ODC API because it is designed to support deterministic operations on petabyte-scale data collections.

For a large compute job (involving massive ODC API reads) it is unacceptable if the API intermittently just omits a dataset from a time-series (leading to incorrect or non-reproducible results), and is also unacceptable if the job hangs or aborts in failure unnecessarily (potentially wasting usage-time of substantial compute resources). The likelihood might also be exacerbated by collection management such as S3 intelligent tiering.

The relevant upstream issue seems to be rasterio/rasterio#2119

We're still finding user reports of intermittent failures of large reads. RasterioIOError: '/vsis3/dea-public-data/baseline/ga_s2bm_ard_3/....tif' not recognized as a supported file format.

Unfortunately this looks difficult to fix:

It's intermittent and difficult to replicate.
GDAL does not expose much of the HTTP interaction process (so we readily can't target a backoff-retry mechanism specifically to scaling-error responses).
ODC skip_broken_datasets will silently produce nondeterministic output.
Substantially increasing GDAL_HTTP_MAX_RETRY should reduce the frequency of incidence but will not eliminate it for large-scale jobs (potentially making it harder to test and handle properly, and potentially making unrelated problems exhibit more wasteful failures).

benjimin · 2024-09-24T01:00:54Z

AWS S3 documents that 503 Slow Down responses may occur if the application does not scale gradually enough. (500 Internal Error is also possible. If the request rate did scale up gradually, then S3 should be able to sustain more than 5000 requests per second to every "folder" or "prefix".)
https://repost.aws/knowledge-center/http-5xx-errors-s3

Kirill888 added enhancement loading data usability labels Oct 2, 2019

stale bot added the wontfix label Aug 8, 2020

stale bot closed this as completed Oct 7, 2020

benjimin reopened this Sep 24, 2024

benjimin removed the wontfix label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support retries when reading from network resources #802

Support retries when reading from network resources #802

Kirill888 commented Oct 2, 2019

Kirill888 commented Nov 13, 2019

stale bot commented Aug 8, 2020

benjimin commented Sep 24, 2024 •

edited

Loading

benjimin commented Sep 24, 2024 •

edited

Loading

Support retries when reading from network resources #802

Support retries when reading from network resources #802

Comments

Kirill888 commented Oct 2, 2019

Introduction

Proposed Structure

Kirill888 commented Nov 13, 2019

stale bot commented Aug 8, 2020

benjimin commented Sep 24, 2024 • edited Loading

benjimin commented Sep 24, 2024 • edited Loading

benjimin commented Sep 24, 2024 •

edited

Loading

benjimin commented Sep 24, 2024 •

edited

Loading