-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Streaming] retry on requests errors #6963
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
src/datasets/utils/file_utils.py
Outdated
ClientError, | ||
TimeoutError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Those two correspond to aiohttp.client_exceptions.ClientError
and asyncio.TimeoutError
. I think it's best to explicitly name them as you did with requests.exceptions.ConnectTimeout
. It was not obvious for me at first (I wondered what was the difference between those 4 only to figure out it was not only requests
errors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good yes 👍 Added a minor comment for clarity but that's all. huggingface_hub
don't really have the concept of retrying on each and every HTTP request (only download requests) so I'm not surprised it can fail.
ci failures are r-unrelated to this PR, merging |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* [Streaming] retry on requests errors * lucain's comment
* [Streaming] retry on requests errors * lucain's comment
reported in https://discuss.huggingface.co/t/speeding-up-streaming-of-large-datasets-fineweb/90714/6 when training using a streaming a dataloader
cc @Wauplin it looks like the retries from
hfh
are not always enough. In this PR I letdatasets
do additional retries (that users can configure indatasets.config
) since I couldn't find an easy way to increase the max_retries forhfh
users in general.