-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retries for rare failures #4704
Comments
As far as I can tell, this has only been happening in gcsfs - so my suggestion, to try to collect the set of conditions that should be considered "retryable" but currently aren't, still holds. However, it is also worthwhile discussing where else in the stack retries might be applied, which would affect multiple storage backends. |
This does happen with some other backends, specifically netCDF and pydap when access remote datasets via HTTP/opendap. We have a xarray/xarray/backends/common.py Line 41 in 20d51cc
I think exponential backoff with fuzzing is the right strategy for rare network failures, but I would suggest trying to push this to as low of a level as possible, e.g., ideally inside gcsfs. Retrying the whole dask computation seems quite wasteful. |
I recently ran into several issues with gcsfs (fsspec/gcsfs#316, fsspec/gcsfs#315, and fsspec/gcsfs#318) where errors are occasionally thrown, but only in large worfklows where enough http calls are made for them to become probable.
@martindurant suggested forcing dask to retry tasks that may fail like this with
.compute(... retries=N)
in fsspec/gcsfs#316, which has worked well. However, I also see this in Xarray/Zarr code interacting with gcsfs directly:Example Traceback
Has there already been a discussion about how to address rare errors like this? Arguably, I could file the same issue with Zarr but it seemed more productive to start here at a higher level of abstraction.
To be clear, the code for the example failure above typically succeeds and reproducing this failure is difficult. I have only seen it a couple times now like this, where the calling code does not include dask, but it did make me want to know if there were any plans to tolerate rare failures in Xarray as Dask does.
The text was updated successfully, but these errors were encountered: