-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aiohttp.client_exceptions.ServerDisconnectedError does not seem to be handled correctly #537
Comments
Does this happen with smaller number or size of files, so we have a better chance to investigate? I can't tell whether the problem happens within the file listing stage or the downloads - do any files appear? If yes, you might try passing |
Ok, I see the issue. The We'll set a default in our library, but I don't think that the |
Also it seems that
It looks like it should be passed if we whitelist it here: https://github.com/dask/s3fs/blob/main/s3fs/core.py#L203 |
@isidentical , I think you'll agree that the above is sensible? |
Oh, it is indeed very large. Maybe we should change this to; else:
return soft_limit // 8 something like this else:
return max(soft_limit // 8, _DEFAULT_BATCH_SIZE) 128 is a reasonable number, and even if it feels small people can override it since this function only gets called when there is no default specified as an argument and the default is not present in the config.
Indeed. I don't see why we need such a filter (CC: @martindurant) since on other filesystems I simply pass all the kwargs to the base class, but if we have a reason we should just also add it to there. Thanks for noticing this, and doing the research! |
No problem! I honestly think 128 is too small as a default - S3 is pretty performant, we've changed the default to be 5,000 and it's scaling fine. To put it another way, we're constrained here by the number of tasks a Python event loop can handle rather than the number of requests S3 can accept. 128 is pretty low for both, maybe 1,024 would be a more reasonable default? |
I have certainly never met any limiting behaviour when making >1000 requests simultaneously. I would tend towards that kind of number. |
Interesting. Is it possible to test this with for example, 500 concurrent tasks for example. It would provide a useful data for determining the limit.
Unfortunately not. This batching system was initially implemented due to open file limit erorrs. The current way of detection is simply max number of files a process can open divided by 8. Maybe we could do change both the normal calculation and the upper limit. How about |
S3 doesn't really have a request limit. At least not one we can determine. If you've got a cold bucket (1 rps) and you suddenly hit it with 5,000 requests per second it will tell you to back off and begin scaling up. But if you've got a hot bucket (~1,000 rps) you can burst up to ~30k rps without issue. The best way would be to dynamically adjust the batch size until you start hitting "go away" errors, and adjust from there. But that's going to be pretty complex. Currently we have a very large upper bound and this issue hasn't been raised before. I think |
Can someone please propose that the general throttle limit be ~>1000, but that the limit for actions requiring local files (get, put) be the current value, which should account for open-file limits. |
@isidentical , actually, I'd like you to do this, if you have some time: @orf , does increasing |
Then it makes sense to expose this publicly from fsspec.asyn, with a better API (instead of automatically infer the batchsize from the open file limit, just use |
Agreed. The default could be different for file and non-file cases, and should be configurable (we currently have the "gather_batch_size" key). |
I might be overlooking something in the discussion above but Im not seeing a clear workaround I can apply for this ServerDisconnectedError error. Is there one?
|
@orf , are you thinking that this should be a case that is retried, but currently is not? |
@martindurant I think this is the issue. Launching 131,072 concurrent tasks, each rushing to send an individual HTTPS request to S3 is a bit optimistic. I would guess that the event-loop stalls and this results in a cascade of But yes, it probably should be added to the set of retriable exceptions here: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L52, but I doubt this will fix the underlying issue? |
OK, so may I suggest that you make a PR here to add the exception type to those which get retried.
Can you figure out how this happened? We have the reasonable upper limit of |
If im understanding correctly there are 2 issues at play here:
I note gcsfs tackled the retries issue for this error last year fsspec/gcsfs#385
|
I don't believe 2. is an issue. zarr does not produce the HTTP calls. fsspec/s3fs does. I see the following two issues:
|
One thing to note here: Lines 1019 to 1041 in 6f844d4
Which means both |
Hm I see - _get_file has the same retry logic as everything else on starting the call via _call_s3, but disconnect can happen sometime later while reading the stream. Ideally, we'd like to be able to restart the comm wherever it left off, so this is a little tricky to code. @orf , would you like to try implementing an outer loop that restarts the download should the local file not reach the required size? |
Sure! I can whip up a PR for that. |
I thought it might help if I produced an example that generally triggers this error when im using zarr. Obviously a silly example but it seems to mostly fail. I run this with a 15 worker dask cluster on EC2
|
I added a fix here: #601 I've also seen FSTimeoutError be thrown, which I also added to the retryable exceptions:
|
Just wanted to comment that i've been using @orf s fix and it is working well for me. |
What happened:
A user reports that running:
On a prefix with ~120,000 files (~4gb) results in a
aiohttp.client_exceptions.ServerDisconnectedError
.What you expected to happen:
The copy should succeed
Minimal Complete Verifiable Example:
Then:
Traceback:
Anything else we need to know?:
Possibly related to aio-libs/aiohttp#4549?
Environment:
The text was updated successfully, but these errors were encountered: