Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3.read_csv slow with chunksize #324

Closed
JPFrancoia opened this issue Jul 16, 2020 · 15 comments
Closed

s3.read_csv slow with chunksize #324

JPFrancoia opened this issue Jul 16, 2020 · 15 comments
Assignees
Labels
bug Something isn't working minor release Will be addressed in the next minor release ready to release
Milestone

Comments

@JPFrancoia
Copy link
Contributor

Describe the bug

I'm not sure the s3.read_csv function really reads a csv in chunks. I noticed that for relatively big dataframes, running the following instruction takes an abnormally large amount of time:

it = wr.s3.read_csv(uri, chunksize=chunksize)

I think the chunksize parameter is ignored.

To Reproduce

I'm running awswrangler==1.1.2 (installed with poetry) but I quickly tested 1.6.3 and it seems the issue is there too.

from itertools import islice
from smart_open import open as sopen
import awswrangler as wr
import pandas as pd
from io import StringIO

uri = ""

CHUNKSIZE = 100

def manual_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    with sopen(uri, "r") as f:
        chunk = "".join(islice(f, chunksize))
        df = pd.read_csv(StringIO(chunk))

    return df


def s3_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    it = wr.s3.read_csv(uri, chunksize=chunksize)

    df = next(it)

    return df

I compared two different ways to load the first 100 lines of a "big" (1.2 GB) dataframe from S3:

  • with the equivalent of open(file, "r") and then lazily parsing the lines as a CSV string
  • using s3.read_csv with chunksize=100.

Results:

In [3]: %time manual_chunking(uri)
CPU times: user 173 ms, sys: 22.9 ms, total: 196 ms
Wall time: 581 ms

In [8]: %time s3_chunking(uri)
CPU times: user 8.73 s, sys: 7.82 s, total: 16.5 s
Wall time: 3min 59s

In [9]: %time wr.s3.read_csv(uri)
CPU times: user 27.3 s, sys: 9.48 s, total: 36.7 s
Wall time: 3min 38s

The timings are more or less reproducible. After comparing the last two timings, I suspect that the chunksize parameter is ignored. It takes more or less the same amount of time to load 100 lines of the file than to read the full file.

Is it expected?

@JPFrancoia JPFrancoia added the bug Something isn't working label Jul 16, 2020
@JPFrancoia
Copy link
Contributor Author

Update:

very interesting, I was able to pin-point the problem to this line:

https://github.com/awslabs/aws-data-wrangler/blob/6f31e1bcd85769d49fb796d0194964d0806ae4d6/awswrangler/s3/_read.py#L174:

with fs.open(path, mode) as f:
            reader: pandas.io.parsers.TextFileReader = parser_func(f, chunksize=chunksize, **pandas_kwargs)
            for df in reader:
                if dataset is True:
                    for column_name, value in partitions.items():
                        df[column_name] = value
                yield df

Apparently making the reader is the problem, it takes too much time. Looking at my network activity, it looks like it triggers the download of "a lot" of data: probably the entire file is downloaded at this point.

@igorborgest igorborgest self-assigned this Jul 16, 2020
@igorborgest igorborgest added this to the 1.7.0 milestone Jul 16, 2020
@igorborgest
Copy link
Contributor

@JPFrancoia awesome troubleshooting! Thanks.

Your analysis leads me to believe that the point is the default block size we configured for s3fs.

Basically, s3fs fetches the object in blocks during the parsing, and there is an intrinsic trade-off on that. Small blocks are memory friendly because it allows you parse your object w/o load big chunks of raw data, however it might introduce too much overhead and I/O usage (Imagine to download a 1 GB objects with 200 requests fetching only 5 MB each one).

I think that internally we should make block_size an argument on the get_fs() function and then start to use different block sizes inside the code base according with each scenario. E.g. Inside the read_csv() when chunksize is used we should instantiate a fs with a smaller block size.

What do you think?

@JPFrancoia
Copy link
Contributor Author

Ahhh, I see. The current block size is ~1GB, and my file is 1.2 GB. That's why I thought the whole file was being downloaded, but it's actually most of the file.

Yes what you suggest makes sense. However it might cause other problems, like: how can you make sure that the smaller blocksize used for read_csv will be enough to accomodate for the chunksize? But this can probably be solved.

@igorborgest
Copy link
Contributor

Yes what you suggest makes sense. However it might cause other problems, like: how can you make sure that the smaller blocksize used for read_csv will be enough to accomodate for the chunksize? But this can probably be solved.

Yep, it could happen, but s3fs already solves this "pagination" mechanism. So in the worst case Wrangler will need to do more than 1 request to fetch a chunk, what is acceptable and better than we have today. By now I prefer to stay simple and accept the trade-off instead of implement some kind of "block size predictor". Don't you agree?

@JPFrancoia
Copy link
Contributor Author

That makes sense indeed, I didn't know s3fs was "that smart".

@igorborgest
Copy link
Contributor

Hey @JPFrancoia,

I've just merged a huge refactoring (#328) regarding the s3 read/write where your issue was addressed.

Could you test our dev branch and check if it addresses your use case before the official release?
The intention is to release it in the version 1.7.0 later this week.

pip install git+https://github.com/awslabs/aws-data-wrangler.git@dev

@igorborgest igorborgest added minor release Will be addressed in the next minor release ready to release labels Jul 20, 2020
@JPFrancoia
Copy link
Contributor Author

JPFrancoia commented Jul 20, 2020

Hi @igorborgest ,

Thanks, it looks like a lot of work!

I tested your changes in the same conditions as before, and in the same conditions I didn't get any errors. Here are the timings I got, for the exact same code:

In [2]: %time manual_chunking(uri)
CPU times: user 103 ms, sys: 29.7 ms, total: 133 ms
Wall time: 540 ms

In [3]: %time s3_chunking(uri)
CPU times: user 425 ms, sys: 370 ms, total: 795 ms
Wall time: 30.1 s

In [11]: %timeit -n 3 s3_chunking(uri)
5.29 s ± 1.04 s per loop (mean ± std. dev. of 7 runs, 3 loops each)

Somehow the first time I pulled with wr.s3.read_csv I got a "large" timing, but I couldn't reproduce. When run 3 times, the new s3.read_csv function takes ~5s. It's not as optimal as the lazy loading line by line from the open file, but it's definitely usable now. I guess the blocksize is slightly too large for my particular use case, but I guess you had to compromise.

Thanks for your hard work.
So far from what I can tell, this code should be good to go.

@igorborgest
Copy link
Contributor

Thanks for testing!

Now I've decreased the buffer size from 32 to 8 MB. I think it will result in a better experience.

Btw, the smart_open default buffer size is 128 KB. That's why the manual_chunking is so fast for this tiny chunk. But is always a matter of trade-off how we discussed above, so I'm happy with the current implementation too.

@JPFrancoia
Copy link
Contributor Author

Sounds great, thanks again.

Out of curiosity, how would I change the block size of s3fs? It's probably not so great to pass it as a parameter to read_csv, but is there some sort of setting that I could set for an entire script?

@igorborgest
Copy link
Contributor

Oh actually this is a good opportunity to use our other new feature that will be released in the version 1.7.0.

I've did it in the commit above.

Could you also test it? It is just configure using:

wr.config.s3fs_block_size = 5 * 2 ** 20  # 5 MB

Or through an environment variable:

export WR_S3FS_BLOCK_SIZE=5242880

Check the related tutorial for more details about this new configuration strategy.

@JPFrancoia
Copy link
Contributor Author

Perfect, that's exactly what we needed to match smart_open's speed:

In [3]: %time manual_chunking(uri)
CPU times: user 96.9 ms, sys: 27.6 ms, total: 124 ms
Wall time: 436 ms

# set block size to more or less the same size as smart_open
In [4]: wr.config.s3fs_block_size = 128000

In [5]: %time s3_chunking(uri)
CPU times: user 101 ms, sys: 25.6 ms, total: 127 ms
Wall time: 616 ms

This config strategy looks nice. I read that it "will override the regular default arguments configured in the function signature." though?

@igorborgest
Copy link
Contributor

I read that it "will override the regular default arguments configured in the function signature." though?

Yes, it was the only original behavior, but now I will update and mention that it can also set internal/not exposed configurations. Does it make sense?

@JPFrancoia
Copy link
Contributor Author

It does. I think the issue is resolved on my side, so feel free to close it when you're ready :)

Thanks again for your support.

@igorborgest
Copy link
Contributor

Released in 1.7.0!

@igorborgest
Copy link
Contributor

igorborgest commented Sep 1, 2020

@JPFrancoia FYI the s3fs_block_size config was replaced by s3_block_size on version 1.9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working minor release Will be addressed in the next minor release ready to release
Projects
None yet
Development

No branches or pull requests

2 participants