Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow .zip uncompression between s3 #742

Closed
tooptoop4 opened this issue Nov 23, 2022 · 6 comments
Closed

slow .zip uncompression between s3 #742

tooptoop4 opened this issue Nov 23, 2022 · 6 comments
Labels

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Nov 23, 2022

this code is writing at around 275k records per minute, any idea how to speed it up?
my file is 6GB compressed, 180GB uncompressed. so it won't fit on local disk or memory. I notice only 200MB of memory is being consumed

import io
import smart_open
import urllib.request
import zipfile
import logging
import boto3
smart_open.s3.DEFAULT_MIN_PART_SIZE = 250 * 1024**2
smart_open.s3.DEFAULT_BUFFER_SIZE = 50 * 1024**2
with io.BytesIO() as b:
    b = open('s3://redacted/biz.csv.zip', mode='rb')
    b.seek(0)
    with zipfile.ZipFile(b, mode='r') as zipf:
        for subfile in zipf.namelist():
            with smart_open.smart_open('s3://redacted/biz.csv', mode='wb') as fout:
                info = zipf.open(subfile)
                printcounter = 0
                for line in info:
                    printcounter += 1
                    info = (line)
                    _ = fout.write(info)
                    if printcounter % 500000 == 0:
                        print(str(printcounter)+' rows')
@mpenkov
Copy link
Collaborator

mpenkov commented Nov 23, 2022

Can't see anything obviously wrong with your source.

Try and work out where the bottleneck is first. Is it:

  1. Reading from S3
  2. Writing to S3
  3. Decompressing the zip file
  4. Something else?

You can reduce the effects of the first two by running on EC2. For 3., there's probably not much you can do.

You can also consider buying a larger hard drive to work on this locally, 180GB is well within range even for a laptop these days ;)

@rustyconover
Copy link
Contributor

I know why this is slow.

The ZipFile module is doing a seek() on every read even if the position it wants to seek to is the current position.

smart_open when it processes a seek() call explicitly drops the current data buffer and sends a new request to S3 even if the destination file position matches the current position.

Performance can be improved by changing smart_open's seek() implementation to compare the destination position versus the current position.

I'll open a PR to improve this.

@lociko
Copy link

lociko commented Jan 10, 2023

Any update here?

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 10, 2023

The PR is stuck because some of the tests are not passing. We are awaiting feedback from the author.

@rustyconover
Copy link
Contributor

rustyconover commented Jan 11, 2023 via email

@JohnHBrock
Copy link
Contributor

I think this can be closed now that #782 is merged.

@mpenkov mpenkov closed this as completed Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants