-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support retries when reading from network resources #802
Comments
Correction: GDAL does support retries see this commit 81bed71 Nevertheless having more control over what happens when load failure occurs is still usefull. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is a fundamental problem for the ODC API because it is designed to support deterministic operations on petabyte-scale data collections. For a large compute job (involving massive ODC API reads) it is unacceptable if the API intermittently just omits a dataset from a time-series (leading to incorrect or non-reproducible results), and is also unacceptable if the job hangs or aborts in failure unnecessarily (potentially wasting usage-time of substantial compute resources). The likelihood might also be exacerbated by collection management such as S3 intelligent tiering. The relevant upstream issue seems to be rasterio/rasterio#2119 We're still finding user reports of intermittent failures of large reads. Unfortunately this looks difficult to fix:
|
AWS S3 documents that 503 Slow Down responses may occur if the application does not scale gradually enough. (500 Internal Error is also possible. If the request rate did scale up gradually, then S3 should be able to sustain more than 5000 requests per second to every "folder" or "prefix".) |
Introduction
Datacube now supports network resources, particularly Cloud Optimized GeoTIFFs residing on S3 or other HTTP based storage systems. Network resources might experience intermittent failures. For example when working with S3 one can hit request limits, in which case server will respond with
5XX
response.GDAL
does not attempt to retry internally as far as I can tell, neither doesrasterio
currently.dc.load
has an option to ignore files that failed to loadskip_broken_datasets=True
, but this was meant to work around slightly out of date DB index (when files were deleted from the file system but not from DB), it's not good idea to use that option in the cloud environment.Proposed Structure
Allow user to specify what datacube should do in the event of failure to open or read some network resource. For example
on_error
callback to supply todc.load|dc.load_data
. Callback will be given information about the failure:open|read
s3://bucket/path/file.tiff
User supplied callback can then specify what to do next, with options being:
skip_broken_datasets=True
)IOError
(current default behaviour)Using callback it's up to the user to implement back off strategy (callback can sleep for a while before indicating to attempt to retry). We can then implement a number of common retry error handling strategies that user can choose for their environment.
The text was updated successfully, but these errors were encountered: