-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to retry a read for an s3-COG (GeoTIFF on AWS s3) #2119
Comments
This sounds related to #1877 and OSGeo/gdal#2294, but must be subtly different. |
Comment / question on those links:
|
This might not be an optimal pattern, but it's a first pass at a function to read s3-COG metadata with actual retries: def raster_file_metadata(geotiff: str) -> Optional[Dict]:
"""
Read raster metadata.
:param geotiff: a GeoTIFF file
:return: a dictionary of metadata, if the geotiff exists
"""
LOGGER.info("Reading GeoTIFF metadata: %s", geotiff)
retry_delay = random.uniform(0.1, 0.5)
retry_jitter = retry_delay / 10
max_tries = 3
num_tries = 0
while num_tries < max_tries:
num_tries += 1
try:
gdal_env = {} # TODO: add GDAL env-vars
if geotiff.startswith("s3:"):
vsis3_path = geotiff.replace("s3:/", "/vsis3")
gdal_env["CPL_VSIL_CURL_NON_CACHED"] = vsis3_path
with rasterio.Env(**gdal_env):
with rasterio.open(geotiff) as src:
return raster_metadata(src)
except rasterio.errors.RasterioIOError as err:
LOGGER.error(err)
time.sleep(retry_delay + num_tries * retry_jitter) Update: something like this function can run into a different kind of error:
This might be related to additional GDAL env-vars, e.g.
|
Below https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access I found this sentence.
It's my understanding that this can also be used for vsis3 requests. Could you give that a try before we see about patching GDAL or enabling workarounds in rasterio? |
If it does, how does rasterio expose any HTTP status codes that could allow consuming code to try to make smarter decisions about how to handle IO errors? If it does not yet, would it be possible for rasterio to expose HTTP status codes on any
The HTTP response might be buried too deep in the library stack (?).
|
Right, @dazza-codes, the responses aren't accessible from a GDAL API. You can see them in stderr if you set |
I'll report back when some requests hit 503s on s3-COG reads after I've run some load tests with some updated settings to explore the use of # Starting with GDAL 2.3, the GDAL_HTTP_MAX_RETRY (number of attempts) and
# GDAL_HTTP_RETRY_DELAY (in seconds) configuration option can be set, so that request
# retries are done in case of HTTP errors 429, 502, 503 or 504.
gdal_env["GDAL_HTTP_MAX_RETRY"] = 3
gdal_env["GDAL_HTTP_RETRY_DELAY"] = 0.5 # default is 30 sec BTW, # Partial downloads (requires the HTTP server to support random reading) are done with a
# 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured
# with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the
# driver detects sequential reading it will progressively increase the chunk size up to
# 2 MB to improve download performance.
if n_points > 100:
gdal_env["CPL_VSIL_CURL_CHUNK_SIZE"] = 5 * 16384
elif n_points > 1:
gdal_env["CPL_VSIL_CURL_CHUNK_SIZE"] = 3 * 16384 When reading with
It might be tricky to manage this setting, depending on how many files are in an s3 prefix, and it could be a little counter-intuitive, but it seems like allowing GDAL to read the s3 prefix to list the objects results in fewer s3 # By default (GDAL_DISABLE_READDIR_ON_OPEN=FALSE), GDAL establishes a list
# of all the files in the directory of the file passed to GDALOpen(). This
# can result in speed-ups in some use cases, but also to major slow downs
# when the directory contains thousands of other files. When set to TRUE,
# GDAL will not try to establish the list of files.
gdal_env["GDAL_DISABLE_READDIR_ON_OPEN"] = False |
Update - it seems to work with
Some relevant GDAL env-var settings used to load-test s3-COG reads: retry_delay = random.uniform(0.2, 0.6)
retry_jitter = retry_delay / 10
retry_delay += retry_jitter
max_tries = 3
gdal_env = {
"GDAL_CACHEMAX": 1024_000_000, # 1 Gb in bytes
"GDAL_DISABLE_READDIR_ON_OPEN": False,
# # debug options for libcurl verbose outputs
# "CPL_DEBUG": True,
# "CPL_CURL_VERBOSE": True,
}
if geotiff.startswith("s3:"):
# https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access
# Note that CPL_VSIL_CURL_NON_CACHED is NOT set
gdal_env["VSI_CACHE"] = True
gdal_env["VSI_CACHE_SIZE"] = gdal_env["GDAL_CACHEMAX"]
gdal_env["GDAL_HTTP_MAX_RETRY"] = max_tries
gdal_env["GDAL_HTTP_RETRY_DELAY"] = retry_delay
try:
with rasterio.Env(**gdal_env) as rio_env:
with rasterio.open(geotiff) as src:
# do stuff
return stuff
except rasterio.errors.RasterioIOError as err:
LOGGER.error("%s rasterio read failed", geotiff)
LOGGER.error(err)
raise An example log under load-testing where s3 rate throttling is expected and observed as 503 errors:
One possible performance problem with this approach is that it could apply to any GDAL reads for any of the files that it searches for, such as Additional GDAL options
# narrow the allowed files to read
gdal_env["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"
gdal_env["CPL_VSIL_CURL_ALLOWED_EXTENSIONS"] = ".tif" |
I deleted some irrelevant comments. Please don't do this, people. Use the discussion group linked in the README under "support". |
When a rasterio reads from S3 and it hits an s3 throttle limit (503), some code to wrap the reader in a retry block fails to retry (sample function code below). It appears as though rasterio/GDAL has registered the dataset as some kind of missing dataset (s3 object) and will refuse to retry reading it.
Sample function to read/retry a COG metadata:
Are there better ways to use rasterio to retry any failed reads for s3-COG metadata and data? Are there any values for the
rasterio.Env()
or other configuration details that will avoid a failure to retry reading an s3-COG when the first read hits an s3-throttle error (503)? Is this a known feature of rasterio/GDAL or is this a bug?Versions are binary wheel installations (pip only with rasterio bundled libs for GDAL) and it runs on AWS lambda runtime containers for python 3.7
The text was updated successfully, but these errors were encountered: