-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3.read_csv slow with chunksize #324
Comments
Update: very interesting, I was able to pin-point the problem to this line: with fs.open(path, mode) as f:
reader: pandas.io.parsers.TextFileReader = parser_func(f, chunksize=chunksize, **pandas_kwargs)
for df in reader:
if dataset is True:
for column_name, value in partitions.items():
df[column_name] = value
yield df Apparently making the reader is the problem, it takes too much time. Looking at my network activity, it looks like it triggers the download of "a lot" of data: probably the entire file is downloaded at this point. |
@JPFrancoia awesome troubleshooting! Thanks. Your analysis leads me to believe that the point is the default block size we configured for Basically, I think that internally we should make What do you think? |
Ahhh, I see. The current block size is ~1GB, and my file is 1.2 GB. That's why I thought the whole file was being downloaded, but it's actually most of the file. Yes what you suggest makes sense. However it might cause other problems, like: how can you make sure that the smaller blocksize used for |
Yep, it could happen, but |
That makes sense indeed, I didn't know s3fs was "that smart". |
Hey @JPFrancoia, I've just merged a huge refactoring (#328) regarding the s3 read/write where your issue was addressed. Could you test our dev branch and check if it addresses your use case before the official release?
|
Hi @igorborgest , Thanks, it looks like a lot of work! I tested your changes in the same conditions as before, and in the same conditions I didn't get any errors. Here are the timings I got, for the exact same code: In [2]: %time manual_chunking(uri)
CPU times: user 103 ms, sys: 29.7 ms, total: 133 ms
Wall time: 540 ms
In [3]: %time s3_chunking(uri)
CPU times: user 425 ms, sys: 370 ms, total: 795 ms
Wall time: 30.1 s
In [11]: %timeit -n 3 s3_chunking(uri)
5.29 s ± 1.04 s per loop (mean ± std. dev. of 7 runs, 3 loops each) Somehow the first time I pulled with Thanks for your hard work. |
Thanks for testing! Now I've decreased the buffer size from 32 to 8 MB. I think it will result in a better experience. Btw, the smart_open default buffer size is 128 KB. That's why the |
Sounds great, thanks again. Out of curiosity, how would I change the block size of s3fs? It's probably not so great to pass it as a parameter to |
Oh actually this is a good opportunity to use our other new feature that will be released in the version I've did it in the commit above. Could you also test it? It is just configure using: wr.config.s3fs_block_size = 5 * 2 ** 20 # 5 MB Or through an environment variable: export WR_S3FS_BLOCK_SIZE=5242880 Check the related tutorial for more details about this new configuration strategy. |
Perfect, that's exactly what we needed to match smart_open's speed: In [3]: %time manual_chunking(uri)
CPU times: user 96.9 ms, sys: 27.6 ms, total: 124 ms
Wall time: 436 ms
# set block size to more or less the same size as smart_open
In [4]: wr.config.s3fs_block_size = 128000
In [5]: %time s3_chunking(uri)
CPU times: user 101 ms, sys: 25.6 ms, total: 127 ms
Wall time: 616 ms This config strategy looks nice. I read that it "will override the regular default arguments configured in the function signature." though? |
Yes, it was the only original behavior, but now I will update and mention that it can also set internal/not exposed configurations. Does it make sense? |
It does. I think the issue is resolved on my side, so feel free to close it when you're ready :) Thanks again for your support. |
Released in 1.7.0! |
@JPFrancoia FYI the |
Describe the bug
I'm not sure the
s3.read_csv
function really reads a csv in chunks. I noticed that for relatively big dataframes, running the following instruction takes an abnormally large amount of time:I think the
chunksize
parameter is ignored.To Reproduce
I'm running awswrangler==1.1.2 (installed with poetry) but I quickly tested 1.6.3 and it seems the issue is there too.
I compared two different ways to load the first 100 lines of a "big" (1.2 GB) dataframe from S3:
open(file, "r")
and then lazily parsing the lines as a CSV strings3.read_csv
withchunksize=100
.Results:
The timings are more or less reproducible. After comparing the last two timings, I suspect that the
chunksize
parameter is ignored. It takes more or less the same amount of time to load 100 lines of the file than to read the full file.Is it expected?
The text was updated successfully, but these errors were encountered: