-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling intermittent data retrieval errors (retries) #18
Comments
Agreed, adding some retry logic could be appropriate, both to dataset opens and reads. Though it might be hard to identify which errors are appropriate to retry. "not recognized as a supported file format" sure doesn't sound like something that you should retry. We'd need to look through the vsicurl -> GDAL -> rasterio logic a bit to understand how HTTP error codes map onto the Python error that's ultimately raised. In the end though, I imagine this will be a user-configurable set of error types to retry (and how much to retry them), where we just provide a reasonable default. So you'd always be free to put We should have a similar set of "nodata" errors, where we just return an an array of NaNs instead of retrying. This would resolve #12 in a more extensible way. |
Yes, I have seen that 'not recognised as a file format' multiple times for data that is absolutely there. e.g. try it again the next time, perfectly fine. I downloaded that whole file by hand and viewed it no problem, just to double check - so it would seem to be some sort of didn't read/network issue meaning it fails to get the identification correctly - your point on the GDAL rasterio errors. |
Doing some more testing...same behaviour, 404 when my internet problem... vsis error when element84 missing data |
@TomAugspurger pointed out in microsoft/PlanetaryComputer#11 (comment) that we could use stackstac/stackstac/rio_reader.py Lines 36 to 42 in d3a78c4
|
@gjoseph92 unfortunately GDAL_HTTP_MAX_RETRY helps only for retrying HTTP errors 429, 502, 503 or 504. As also said in other comments, I also think it would be nice to support retrying user-specified errors |
@gjoseph92 I am trying to solve this with the approach mentioned above. For this I tried to separate the dataset reader creation in a different function so that users can overwrite it with a retry. My changes in stackstac can be found here. Later I tried to overwrite the method like: import stackstac
from rasterio import RasterioIOError
from time import sleep
class AutoParallelRioReaderWithRetry(stackstac.rio_reader.AutoParallelRioReader):
def _get_ds(self) -> stackstac.rio_reader.SelfCleaningDatasetReader:
"""
Retry
"""
retries = 10
retries_delay = 10
for _ in range(retries):
try:
return stackstac.rio_reader.SelfCleaningDatasetReader(self.url, sharing=False)
except RasterioIOError as ex:
exception = ex
dns_problem_condition = ("Could not resolve host" in str(ex))
read_problem_condition = ("not recognized as a supported dataset name" in str(ex))
if dns_problem_condition or read_problem_condition:
print("retrying")
sleep(retries_delay)
continue
print(f"Failed to open {self.url} with exception {ex}")
raise ex
raise Exception(f"Failed to open {self.url} after {retries} retries with error {exception}") However I notice that the same readers are failing again and again within the same compute run although they succeed in different runs. I even had cases when compute run without any problem. Is there any place that the result of SelfCleaningDatasetReader is getting cached? EDIT
The reader class with the time log: from rasterio import RasterioIOError
from time import sleep, time
class AutoParallelRioReaderWithRetry(stackstac.rio_reader.AutoParallelRioReader):
def _get_ds(self) -> stackstac.rio_reader.SelfCleaningDatasetReader:
"""
Retry
"""
retries = 10
retries_delay = 10
for i in range(retries):
try:
time_start = time()
return stackstac.rio_reader.SelfCleaningDatasetReader(self.url, sharing=False)
except RasterioIOError as ex:
exception = ex
dns_problem_condition = ("Could not resolve host" in str(ex))
read_problem_condition = ("not recognized as a supported dataset name" in str(ex))
if dns_problem_condition or read_problem_condition:
print("retrying")
print(f"Time for attempt {i+1}: {time() - time_start}")
sleep(retries_delay)
continue
print(f"Failed to open {self.url} with exception {ex}")
raise ex
raise Exception(f"Failed to open {self.url} after {retries} retries with error {exception}") EDIT 2 import rasterio as rio
import gc
import time
with rio.Env():
# Time the process 10 times
for i in range(10):
start = time.time()
a = rio.DatasetReader("<path to to remote file>", sharing=False)
print("DS", time.time() - start)
a.close()
del a
gc.collect() Result
SOLUTION gdal_env = stackstac.rio_env.LayeredEnv(
always=dict(
GDAL_HTTP_MULTIRANGE="YES", # unclear if this actually works
GDAL_HTTP_MERGE_CONSECUTIVE_RANGES="YES",
# ^ unclear if this works either. won't do much when our dask chunks are aligned to the dataset's chunks.
CPL_VSIL_CURL_USE_HEAD="NO",
CPL_VSIL_CURL_NON_CACHED="/vsicurl",
),
open=dict(
GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
# ^ stop GDAL from requesting `.aux` and `.msk` files from the bucket (speeds up `open` time a lot)
VSI_CACHE=True
# ^ cache HTTP requests for opening datasets. This is critical for `ThreadLocalRioDataset`,
# which re-opens the same URL many times---having the request cached makes subsequent `open`s
# in different threads snappy.
),
read=dict(
VSI_CACHE=False
# ^ *don't* cache HTTP requests for actual data. We don't expect to re-request data,
# so this would just blow out the HTTP cache that we rely on to make repeated `open`s fast
# (see above)
),
)
from rasterio import RasterioIOError
from time import sleep, time
class AutoParallelRioReaderWithRetry(stackstac.rio_reader.AutoParallelRioReader):
def _sefefy_rasterio_operation(self, fn, *args, **kwargs):
"""
Retry
"""
retries = 10
retries_delay = 10
for i in range(retries):
try:
time_start = time()
return fn(*args, **kwargs)
except RasterioIOError as ex:
exception = ex
dns_problem_condition = ("Could not resolve host" in str(ex))
read_problem_condition = ("not recognized as a supported" in str(ex))
if dns_problem_condition or read_problem_condition:
print("retrying")
print(f"Time for attempt {i+1}: {time() - time_start}")
sleep(retries_delay)
continue
print(f"Failed to open {self.url} with exception {ex}")
raise ex
raise Exception(f"Failed to open {self.url} after {retries} retries with error {exception}")
def _get_ds(self) -> stackstac.rio_reader.SelfCleaningDatasetReader:
"""
Retry
"""
return self._sefefy_rasterio_operation(stackstac.rio_reader.SelfCleaningDatasetReader, self.url, sharing=False)
def _reader_read(self, reader, window, **kwargs):
return self._sefefy_rasterio_operation(
reader.read,
window=window,
out_dtype=self.dtype,
masked=True,
# ^ NOTE: we always do a masked array, so we can safely apply scales and offsets
# without potentially altering pixels that should have been the ``fill_value``
**kwargs,
)
....
items = ...
stack = stackstac.stack(items, gdal_env=gdal_env, reader=AutoParallelRioReaderWithRetry) |
Now and then it seems a download fails - not because it doesn't exist, just one of those internet things. Even with excellent AWS data to AWS s3.
This is the usual error:
CPLE_OpenFailedError: '/vsicurl/https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/22/M/EV/2020/7/S2A_22MEV_20200704_0_L2A/B02.tif' not recognized as a supported file format.
So something where it can retry when it is not a 404?
The text was updated successfully, but these errors were encountered: