-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet][Python] Variable download speed from threads when reading S3 #38664
Comments
Currently default impl using |
Tried this now, setting ( |
Sigh, Fragment read code can have multiple pathes, and they're always hacking, including scanner, CacheOptions and Parquet file stats... It's hard to debugging it without local logs... Maybe we should add more document for that... |
After some research, based on how the curves look, I think the most likely cause is network congestion control. It's odd to see such dramatic differences between connections, doesn't feel optimal, but I guess this is related to my network quality and not something that can be improved in Arrow? |
For what it's worth, this same |
I suspect this might be the issue. All of the connections are made to the same IP address (except the first one, I believe that's from the initial
|
Did some benchmarking against AWS CLI (
|
Some AWS EC env is able to control using |
Hmm, if you mean AWS_EC2_METADATA_DISABLED I think that's related to determining the region? I understood that the DNS resolution to one IP would be something arising from libcurl (as setting the shuffle flag helped) I think I will just close this issue and open a new one if I have something more concrete... |
I mean that maybe some flags in https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html helps, but I didn't test them. S3FS in arrow fs just wraps the S3 SDK, maybe we can find some existing solutions in S3 and port from them? |
Right, I couldn't find any AWS flag that would help here. There is also this issue about switching to the S3CrtClient in AWS SDK: #18923 I actually tried the CRT client with Arrow, it's a drop-in replacement, and in my limited test cases it does give higher throughput than the old S3Client. Not sure what the main sources of performance boost are, but it does handle the IP address discovery in some clever way and automatically splits GET requests into smaller ranges and parallelizes. The main problem I've noticed is that it actually does a HEAD request for each byte-range GET, in order to validate the range, so this would be an issue for low-latency use cases (I've been hoping that HEAD requests could be eliminated 😅). |
Let's move #37868 forward... |
@eeroel We want to performance test arrow with the aws CRT client as you did. You said "I actually tried the CRT client with Arrow, it's a drop-in replacement, and in my limited test cases it does give higher throughput than the old S3Client." Can you provide us a patch or code fragment for Arrow to use the CRT? Thanks! |
@drjohnrbusch Sure! It's here, it's not up to date but I think it should compile: https://github.com/eeroel/arrow/tree/feat/use_crt |
Describe the enhancement requested
Hi,
When reading Parquet concurrently from S3 using Arrow S3fs implementation with FileSystemDataset, I observe that often some threads download data much slower than others. It also seems that the downloads don't get any faster when other threads complete, which I would expect if it was just about network saturation (but networking is not my strong suit). The result of this is that reading a table is often slow and has high variability, and using
pre_buffer=True
can sometimes hamper performance because individual threads download larger chunks.Here's an example of cumulative data fetched versus time, using nyc-taxi data, with four equal-sized chunks being downloaded from four threads. Two of the downloads take > 2x the time of the two faster ones. The data is extracted from lines in the S3 debug output with "bytes written".
I'm running on Mac OS 14.1, Apple Silicon.
Here's the code that reproduces this issue on my system, and writes the logs:
With that script in
perf_testing.py
I filter the logs like so:python perf_testing.py > out.log && cat out.log | grep "bytes written" > out_filtered.log
And here is the code to produce the plot from the filtered log:
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: