slow .zip uncompression between s3 #742

tooptoop4 · 2022-11-23T10:39:07Z

this code is writing at around 275k records per minute, any idea how to speed it up?
my file is 6GB compressed, 180GB uncompressed. so it won't fit on local disk or memory. I notice only 200MB of memory is being consumed

import io
import smart_open
import urllib.request
import zipfile
import logging
import boto3
smart_open.s3.DEFAULT_MIN_PART_SIZE = 250 * 1024**2
smart_open.s3.DEFAULT_BUFFER_SIZE = 50 * 1024**2
with io.BytesIO() as b:
    b = open('s3://redacted/biz.csv.zip', mode='rb')
    b.seek(0)
    with zipfile.ZipFile(b, mode='r') as zipf:
        for subfile in zipf.namelist():
            with smart_open.smart_open('s3://redacted/biz.csv', mode='wb') as fout:
                info = zipf.open(subfile)
                printcounter = 0
                for line in info:
                    printcounter += 1
                    info = (line)
                    _ = fout.write(info)
                    if printcounter % 500000 == 0:
                        print(str(printcounter)+' rows')

mpenkov · 2022-11-23T13:01:54Z

Can't see anything obviously wrong with your source.

Try and work out where the bottleneck is first. Is it:

Reading from S3
Writing to S3
Decompressing the zip file
Something else?

You can reduce the effects of the first two by running on EC2. For 3., there's probably not much you can do.

You can also consider buying a larger hard drive to work on this locally, 180GB is well within range even for a laptop these days ;)

rustyconover · 2022-12-18T15:27:37Z

I know why this is slow.

The ZipFile module is doing a seek() on every read even if the position it wants to seek to is the current position.

smart_open when it processes a seek() call explicitly drops the current data buffer and sends a new request to S3 even if the destination file position matches the current position.

Performance can be improved by changing smart_open's seek() implementation to compare the destination position versus the current position.

I'll open a PR to improve this.

lociko · 2023-01-10T11:54:16Z

Any update here?

mpenkov · 2023-01-10T13:10:14Z

The PR is stuck because some of the tests are not passing. We are awaiting feedback from the author.

rustyconover · 2023-01-11T00:35:48Z

Hi All, Just busy right now but the change has been working great so far in my extended testing. I’ll get to this when free. Rusty

…

On Tue, Jan 10, 2023 at 08:10 Michael Penkov ***@***.***> wrote: The PR is stuck because some of the tests are not passing. We are awaiting feedback from the author. — Reply to this email directly, view it on GitHub <#742 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFSWJN4HAJFD36ED73ASW3WRVNUBANCNFSM6AAAAAASIZ2ZP4> . You are receiving this because you commented.Message ID: ***@***.***>

JohnHBrock · 2023-11-03T21:06:59Z

I think this can be closed now that #782 is merged.

mpenkov added the question label Nov 29, 2022

rustyconover mentioned this issue Dec 18, 2022

fix: S3 ignore seek requests to the current position #748

Closed

mpenkov closed this as completed Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow .zip uncompression between s3 #742

slow .zip uncompression between s3 #742

tooptoop4 commented Nov 23, 2022 •

edited by mpenkov

Loading

mpenkov commented Nov 23, 2022

rustyconover commented Dec 18, 2022

lociko commented Jan 10, 2023

mpenkov commented Jan 10, 2023

rustyconover commented Jan 11, 2023 via email

JohnHBrock commented Nov 3, 2023

slow .zip uncompression between s3 #742

slow .zip uncompression between s3 #742

Comments

tooptoop4 commented Nov 23, 2022 • edited by mpenkov Loading

mpenkov commented Nov 23, 2022

rustyconover commented Dec 18, 2022

lociko commented Jan 10, 2023

mpenkov commented Jan 10, 2023

rustyconover commented Jan 11, 2023 via email

JohnHBrock commented Nov 3, 2023

tooptoop4 commented Nov 23, 2022 •

edited by mpenkov

Loading