s3.read_csv slow with chunksize #324

JPFrancoia · 2020-07-16T09:34:01Z

Describe the bug

I'm not sure the s3.read_csv function really reads a csv in chunks. I noticed that for relatively big dataframes, running the following instruction takes an abnormally large amount of time:

it = wr.s3.read_csv(uri, chunksize=chunksize)

I think the chunksize parameter is ignored.

To Reproduce

I'm running awswrangler==1.1.2 (installed with poetry) but I quickly tested 1.6.3 and it seems the issue is there too.

from itertools import islice
from smart_open import open as sopen
import awswrangler as wr
import pandas as pd
from io import StringIO

uri = ""

CHUNKSIZE = 100

def manual_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    with sopen(uri, "r") as f:
        chunk = "".join(islice(f, chunksize))
        df = pd.read_csv(StringIO(chunk))

    return df


def s3_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    it = wr.s3.read_csv(uri, chunksize=chunksize)

    df = next(it)

    return df

I compared two different ways to load the first 100 lines of a "big" (1.2 GB) dataframe from S3:

with the equivalent of open(file, "r") and then lazily parsing the lines as a CSV string
using s3.read_csv with chunksize=100.

Results:

In [3]: %time manual_chunking(uri)
CPU times: user 173 ms, sys: 22.9 ms, total: 196 ms
Wall time: 581 ms

In [8]: %time s3_chunking(uri)
CPU times: user 8.73 s, sys: 7.82 s, total: 16.5 s
Wall time: 3min 59s

In [9]: %time wr.s3.read_csv(uri)
CPU times: user 27.3 s, sys: 9.48 s, total: 36.7 s
Wall time: 3min 38s

The timings are more or less reproducible. After comparing the last two timings, I suspect that the chunksize parameter is ignored. It takes more or less the same amount of time to load 100 lines of the file than to read the full file.

Is it expected?

The text was updated successfully, but these errors were encountered:

JPFrancoia · 2020-07-16T10:36:47Z

Update:

very interesting, I was able to pin-point the problem to this line:

https://github.com/awslabs/aws-data-wrangler/blob/6f31e1bcd85769d49fb796d0194964d0806ae4d6/awswrangler/s3/_read.py#L174:

with fs.open(path, mode) as f:
            reader: pandas.io.parsers.TextFileReader = parser_func(f, chunksize=chunksize, **pandas_kwargs)
            for df in reader:
                if dataset is True:
                    for column_name, value in partitions.items():
                        df[column_name] = value
                yield df

Apparently making the reader is the problem, it takes too much time. Looking at my network activity, it looks like it triggers the download of "a lot" of data: probably the entire file is downloaded at this point.

igorborgest · 2020-07-16T11:17:53Z

@JPFrancoia awesome troubleshooting! Thanks.

Your analysis leads me to believe that the point is the default block size we configured for s3fs.

Basically, s3fs fetches the object in blocks during the parsing, and there is an intrinsic trade-off on that. Small blocks are memory friendly because it allows you parse your object w/o load big chunks of raw data, however it might introduce too much overhead and I/O usage (Imagine to download a 1 GB objects with 200 requests fetching only 5 MB each one).

I think that internally we should make block_size an argument on the get_fs() function and then start to use different block sizes inside the code base according with each scenario. E.g. Inside the read_csv() when chunksize is used we should instantiate a fs with a smaller block size.

What do you think?

JPFrancoia · 2020-07-16T11:27:54Z

Ahhh, I see. The current block size is ~1GB, and my file is 1.2 GB. That's why I thought the whole file was being downloaded, but it's actually most of the file.

Yes what you suggest makes sense. However it might cause other problems, like: how can you make sure that the smaller blocksize used for read_csv will be enough to accomodate for the chunksize? But this can probably be solved.

igorborgest · 2020-07-16T11:55:54Z

Yes what you suggest makes sense. However it might cause other problems, like: how can you make sure that the smaller blocksize used for read_csv will be enough to accomodate for the chunksize? But this can probably be solved.

Yep, it could happen, but s3fs already solves this "pagination" mechanism. So in the worst case Wrangler will need to do more than 1 request to fetch a chunk, what is acceptable and better than we have today. By now I prefer to stay simple and accept the trade-off instead of implement some kind of "block size predictor". Don't you agree?

JPFrancoia · 2020-07-16T11:58:36Z

That makes sense indeed, I didn't know s3fs was "that smart".

igorborgest · 2020-07-20T00:48:10Z

Hey @JPFrancoia,

I've just merged a huge refactoring (#328) regarding the s3 read/write where your issue was addressed.

Could you test our dev branch and check if it addresses your use case before the official release?
The intention is to release it in the version 1.7.0 later this week.

pip install git+https://github.com/awslabs/aws-data-wrangler.git@dev

JPFrancoia · 2020-07-20T07:39:12Z

Hi @igorborgest ,

Thanks, it looks like a lot of work!

I tested your changes in the same conditions as before, and in the same conditions I didn't get any errors. Here are the timings I got, for the exact same code:

In [2]: %time manual_chunking(uri)
CPU times: user 103 ms, sys: 29.7 ms, total: 133 ms
Wall time: 540 ms

In [3]: %time s3_chunking(uri)
CPU times: user 425 ms, sys: 370 ms, total: 795 ms
Wall time: 30.1 s

In [11]: %timeit -n 3 s3_chunking(uri)
5.29 s ± 1.04 s per loop (mean ± std. dev. of 7 runs, 3 loops each)

Somehow the first time I pulled with wr.s3.read_csv I got a "large" timing, but I couldn't reproduce. When run 3 times, the new s3.read_csv function takes ~5s. It's not as optimal as the lazy loading line by line from the open file, but it's definitely usable now. I guess the blocksize is slightly too large for my particular use case, but I guess you had to compromise.

Thanks for your hard work.
So far from what I can tell, this code should be good to go.

igorborgest · 2020-07-20T11:50:30Z

Thanks for testing!

Now I've decreased the buffer size from 32 to 8 MB. I think it will result in a better experience.

Btw, the smart_open default buffer size is 128 KB. That's why the manual_chunking is so fast for this tiny chunk. But is always a matter of trade-off how we discussed above, so I'm happy with the current implementation too.

JPFrancoia · 2020-07-20T11:57:06Z

Sounds great, thanks again.

Out of curiosity, how would I change the block size of s3fs? It's probably not so great to pass it as a parameter to read_csv, but is there some sort of setting that I could set for an entire script?

igorborgest · 2020-07-20T16:21:56Z

Oh actually this is a good opportunity to use our other new feature that will be released in the version 1.7.0.

I've did it in the commit above.

Could you also test it? It is just configure using:

wr.config.s3fs_block_size = 5 * 2 ** 20  # 5 MB

Or through an environment variable:

export WR_S3FS_BLOCK_SIZE=5242880

Check the related tutorial for more details about this new configuration strategy.

JPFrancoia · 2020-07-21T07:46:41Z

Perfect, that's exactly what we needed to match smart_open's speed:

In [3]: %time manual_chunking(uri)
CPU times: user 96.9 ms, sys: 27.6 ms, total: 124 ms
Wall time: 436 ms

# set block size to more or less the same size as smart_open
In [4]: wr.config.s3fs_block_size = 128000

In [5]: %time s3_chunking(uri)
CPU times: user 101 ms, sys: 25.6 ms, total: 127 ms
Wall time: 616 ms

This config strategy looks nice. I read that it "will override the regular default arguments configured in the function signature." though?

igorborgest · 2020-07-21T10:12:54Z

I read that it "will override the regular default arguments configured in the function signature." though?

Yes, it was the only original behavior, but now I will update and mention that it can also set internal/not exposed configurations. Does it make sense?

JPFrancoia · 2020-07-22T19:53:19Z

It does. I think the issue is resolved on my side, so feel free to close it when you're ready :)

Thanks again for your support.

igorborgest · 2020-07-30T14:26:35Z

Released in 1.7.0!

igorborgest · 2020-09-01T01:58:18Z

@JPFrancoia FYI the s3fs_block_size config was replaced by s3_block_size on version 1.9.0

JPFrancoia added the bug Something isn't working label Jul 16, 2020

igorborgest self-assigned this Jul 16, 2020

igorborgest added this to the 1.7.0 milestone Jul 16, 2020

igorborgest mentioned this issue Jul 20, 2020

S3 read/write refactoring. #328

Merged

igorborgest added minor release Will be addressed in the next minor release ready to release labels Jul 20, 2020

igorborgest added a commit that referenced this issue Jul 20, 2020

Deacrease the s3fs buffer to 8MB for chunked reads and more. #324

30f4a5b

igorborgest added a commit that referenced this issue Jul 20, 2020

Add s3fs_block_size config and more. #324

d08ac09

igorborgest added a commit that referenced this issue Jul 30, 2020

Deacrease the s3fs buffer to 8MB for chunked reads and more. #324

720fd18

igorborgest added a commit that referenced this issue Jul 30, 2020

Add s3fs_block_size config and more. #324

760602e

igorborgest closed this as completed Jul 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3.read_csv slow with chunksize #324

s3.read_csv slow with chunksize #324

JPFrancoia commented Jul 16, 2020

JPFrancoia commented Jul 16, 2020

igorborgest commented Jul 16, 2020

JPFrancoia commented Jul 16, 2020

igorborgest commented Jul 16, 2020

JPFrancoia commented Jul 16, 2020

igorborgest commented Jul 20, 2020

JPFrancoia commented Jul 20, 2020 •

edited

Loading

igorborgest commented Jul 20, 2020

JPFrancoia commented Jul 20, 2020

igorborgest commented Jul 20, 2020

JPFrancoia commented Jul 21, 2020

igorborgest commented Jul 21, 2020

JPFrancoia commented Jul 22, 2020

igorborgest commented Jul 30, 2020

igorborgest commented Sep 1, 2020 •

edited

Loading

s3.read_csv slow with chunksize #324

s3.read_csv slow with chunksize #324

Comments

JPFrancoia commented Jul 16, 2020

JPFrancoia commented Jul 16, 2020

igorborgest commented Jul 16, 2020

JPFrancoia commented Jul 16, 2020

igorborgest commented Jul 16, 2020

JPFrancoia commented Jul 16, 2020

igorborgest commented Jul 20, 2020

JPFrancoia commented Jul 20, 2020 • edited Loading

igorborgest commented Jul 20, 2020

JPFrancoia commented Jul 20, 2020

igorborgest commented Jul 20, 2020

JPFrancoia commented Jul 21, 2020

igorborgest commented Jul 21, 2020

JPFrancoia commented Jul 22, 2020

igorborgest commented Jul 30, 2020

igorborgest commented Sep 1, 2020 • edited Loading

JPFrancoia commented Jul 20, 2020 •

edited

Loading

igorborgest commented Sep 1, 2020 •

edited

Loading