-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HttpError with no message halfway through large GS write workload #316
Comments
The broken-pipe certainly sounds like it should be retried, and it's odd that for this case we don't know. Did you not even get an HTTP status code? There's probably a longer list of specific errors that should be considered retriable. Note that dask allows you to retry whole failed tasks, and that might be a good safeguard against weird intermittent problems on a small number of tasks in a large graph. Of course, we'd like to get it fixed anyway. |
I didn't and here's the full log for reference where snakemake is running a script that uses local dask: LogBuilding DAG of jobs... Creating conda environment envs/gwas.yaml... Downloading and installing remote packages. Environment for envs/gwas.yaml created (location: .snakemake/conda/0a479a2e) Using shell: /bin/bash Provided cores: 64 Rules claiming more threads will be scaled down. Job counts: count jobs 1 bgen_to_zarr 1 Select jobs to execute... [Fri Dec 4 12:37:26 2020] Downloading from remote: rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample Shutting down, this might take some time.
Is there a way to do that when working through Xarray? Or is there some global dask distributed property that would control that? I looked at one point and was a little confused as to how that's supposed to work and whether or not it's safe with IO operations like this. |
I'm not sure - it's an optional argument to |
I saw this twice today outside of the context of Dask, once in Xarray as logged in pydata/xarray#4704 and again when called from Pandas:
I mention it because of the 3 errors I linked to in pydata/xarray#4704, this one appears to be the most prevalent (and difficult to work around). |
Can you please apply the following and see if you get extra information? --- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1299,7 +1299,7 @@ class GCSFileSystem(AsyncFileSystem):
elif "invalid" in str(msg):
raise ValueError("Bad Request: %s\n%s" % (path, msg))
elif error:
- raise HttpError(error)
+ raise HttpError({"code": status, "message": error})
elif status:
raise HttpError({"code": status}) (I can push this to a branch, if that helps with installation) |
The only mention I can find of something similar is this , where
which is not very descriptive. |
Indeed, why not add the following for completeness --- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1290,6 +1290,7 @@ class GCSFileSystem(AsyncFileSystem):
# TODO: limit to appropriate exceptions
msg = content
+ logger.debug("Error condition: %s" % ((status, content, json, path, headers), ))
if status == 404:
raise FileNotFoundError
elif status == 403: |
Thanks @martindurant, I patched my client environment and will post anything that gets caught here. |
I'm getting this same |
For Dask retries you may want to try the dask.annotate function with the
retries= keyword
This will require the latest release I think.
…On Wed, Dec 30, 2020, 7:22 PM Sam Levang ***@***.***> wrote:
I'm getting this same gcsfs.utils.HttpError: Required error with
long-running zarr writes to GCS. The errors are common enough that jobs
with ~100k chunks usually fail. I'll add these debug lines tomorrow and try
running again to see if there is any more detail on the HttpError.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#316 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA>
.
|
Actually, it may still be in a PR.
…On Wed, Dec 30, 2020, 8:48 PM Matthew Rocklin ***@***.***> wrote:
For Dask retries you may want to try the dask.annotate function with the
retries= keyword
This will require the latest release I think.
On Wed, Dec 30, 2020, 7:22 PM Sam Levang ***@***.***> wrote:
> I'm getting this same gcsfs.utils.HttpError: Required error with
> long-running zarr writes to GCS. The errors are common enough that jobs
> with ~100k chunks usually fail. I'll add these debug lines tomorrow and try
> running again to see if there is any more detail on the HttpError.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#316 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA>
> .
>
|
… On Wed, Dec 30, 2020, 8:49 PM Matthew Rocklin ***@***.***> wrote:
Actually, it may still be in a PR.
On Wed, Dec 30, 2020, 8:48 PM Matthew Rocklin ***@***.***> wrote:
> For Dask retries you may want to try the dask.annotate function with the
> retries= keyword
>
> This will require the latest release I think.
>
> On Wed, Dec 30, 2020, 7:22 PM Sam Levang ***@***.***>
> wrote:
>
>> I'm getting this same gcsfs.utils.HttpError: Required error with
>> long-running zarr writes to GCS. The errors are common enough that jobs
>> with ~100k chunks usually fail. I'll add these debug lines tomorrow and try
>> running again to see if there is any more detail on the HttpError.
>>
>> —
>> You are receiving this because you are subscribed to this thread.
>> Reply to this email directly, view it on GitHub
>> <#316 (comment)>, or
>> unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA>
>> .
>>
>
|
Here's the traceback:
The docs mentioned above are pretty cryptic so it's hard to know what is going wrong, but for now could this 400 be added to |
Yes, I suppose we can allow 400 in general or specifically look for this weird response. I hope it is indeed intermittent. |
We have been running into somewhat the same error (#323 ) and were able to get some more information. We have been running our script for the last few hours and were able to extract some more information about the error that we were seeing. This code ran on a GCE VM. The following errors happened consecutively:
First, a gateway timeout happened (status: 504) after which the A possible solution would be to make the number of retries configurable (in order to mitigate the risk of reach API call limits) and retry on 401 as well? An exponential backoff would also be preferable. |
How about: if the initial error (504) seems to be retriable, then we continue retrying whatever the subsequent errors? Could you give that a try to see if it does the business - or maybe the first error changes something fundamental and there needs to be a deeper kind of reset. |
I am seeing similar issues, but also see a KeyError for a missing DataArray (which I verified is on disk as part of a dataset) bubbling up through xarray/zarr. I'm leaving this info here in case it helps or is related. I grabbed HEAD, which is formatting the HttpErrors a little better and shows the status code in the traceback. Some of my jobs hit the 401, while others hit the KeyError. I don't see a 504. Both can be remedied by enough retries of a workload. HttpError
KeyError
|
@chrisroat , was there again no message with the HTTP errors? The KeyError is not a surprise, that is a valid outcome from zarr when it seems like the file in question can't be reached. Could one of the people on this thread please aggregate what we have learned into a PR, which either recognises the errors encountered here and they become retriable; or else retries all errors by default except for a specific list that we know are not retriable? |
@martindurant The HttpErrors only have the status code passed in. My errors are 401 or KeyError. In general, I don't think a 401 should be retried if there is a chance it's real, and a token has expired? In the @DPGrev, it seems like the gateway error has triggered something, and it's possible the retries will all be 401s. @DPGrev, it is possible for you to check by using HEAD and adding 401 to the list in this function? Also, note that the back-off is exponential already. For the KeyError case, it means there was no underlying error that was caught -- it seems tricky to remedy that if it is at the GCS layer. |
I am getting the same error as was posted at Dec 31 in this thread (#316 (comment)), with a similar traceback and the 'Required: 400'. I understand this might have been solved by #335, so I'll try using gcsfs head now instead of 0.7.2. Though not getting the error is no proof that it is solved (it is quite intermittent, and random), I'll post back here with my results. |
Just noticed the fix in #380 so I tried out a test run with the latest changes. The good news is that did seem to fix the 400 errors. I made it through 1.7TB of a 2+TB dataset write (much further than before), but the bad news is I ran into another failure:
|
Would you say that this is another error that we should retry for? There are quite a few exceptions in aiohttp.client_exceptions.ClientConnectionError
aiohttp.client_exceptions.ClientConnectorCertificateError
aiohttp.client_exceptions.ClientConnectorError
aiohttp.client_exceptions.ClientConnectorSSLError
aiohttp.client_exceptions.ClientError
aiohttp.client_exceptions.ClientHttpProxyError
aiohttp.client_exceptions.ClientOSError
aiohttp.client_exceptions.ClientPayloadError
aiohttp.client_exceptions.ClientProxyConnectionError
aiohttp.client_exceptions.ClientResponse
aiohttp.client_exceptions.ClientResponseError
aiohttp.client_exceptions.ClientSSLError
aiohttp.client_exceptions.ContentTypeError
aiohttp.client_exceptions.ServerConnectionError
aiohttp.client_exceptions.ServerDisconnectedError
aiohttp.client_exceptions.ServerTimeoutError
aiohttp.client_exceptions.WSServerHandshakeError``` |
This was definitely intermittent since the job had already been running a couple hours, so I think retrying is appropriate. It seems hard to pick through these and figure out which should actually should be retried, so another approach would be to just retry all with a reasonable limit on total retries? |
Yes, I tend to agree with you, after having previously resisted the idea. There are just too many fail cases! The list is here, so this could be edited to add |
Please feel free to put that in a PR :) |
Can do. I'll test it out again with this workload and then make a PR. |
Similar to #315, I saw this today after 10s of GB of data had already been written to a Zarr archive (i.e. this isn't a problem with initial writes, it appears to be something spurious in long-running jobs):
Any ideas what this could be or if it should be caught/retried somewhere?
gcsfs
version: 0.7.1The text was updated successfully, but these errors were encountered: