Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-like object returned from open_url is extremely slow with S3 #241

Closed
belltailjp opened this issue Dec 21, 2021 · 0 comments · Fixed by #247
Closed

File-like object returned from open_url is extremely slow with S3 #241

belltailjp opened this issue Dec 21, 2021 · 0 comments · Fixed by #247
Labels
cat:performance Performance in terms of speed or memory consumption.
Milestone

Comments

@belltailjp
Copy link
Member

belltailjp commented Dec 21, 2021

open_url returns a file-like object, which is so handy that we can directly pass it to various libraries that receive a file-like object when opening stuff, such as PIL.Image.open, np.load and pickle.load etc.
However, I noticed that it is the way slower than expected.
If we make a BytesIO from the binary obtained from read(), it yields better performance.

I guess this future work comment (lack of buffered IO) has something to do with it.
https://github.com/pfnet/pfio/blob/master/pfio/v2/s3.py#L360-L361

Situation

Here I have several test data on S3 (actually Ozone) storage.

  • DSC07917.jpg: a 1.1MB jpeg image
  • random.npy: approx 128MB numpy random array (made by np.save('random.npy', np.random.random((4096, 4096))))

Directly passing the file-like object to these library:

>>> %%time
>>> with pfio.v2.open_url('s3://<bucket>/DSC07917.jpg', 'rb') as f:
...     print(PIL.Image.open(f).size)
(2626, 1776)
CPU times: user 642 ms, sys: 56.5 ms, total: 698 ms
Wall time: 2.22 s    # <--- only 0.5MB/s

>>> %%time
>>> with pfio.v2.open_url('s3://<bucket>/random.npy', 'rb') as f:
...     print(np.load(f).shape)
(4096, 4096)
CPU times: user 9.1 s, sys: 1.08 s, total: 10.2 s
Wall time: 35.5 s    # <--- only 3.6MB/s

Load the entire content to binary and then make BytesIO

>>> %%time
>>> with pfio.v2.open_url('s3://<bucket>/DSC07917.jpg', 'rb') as f:
...     print(PIL.Image.open(io.BytesIO(f.read())).size)
(2626, 1776)
CPU times: user 46.9 ms, sys: 0 ns, total: 46.9 ms
Wall time: 108 ms

>>> %%time
>>> with pfio.v2.open_url('s3://<bucket>/random.npy', 'rb') as f:
...     print(np.load(io.BytesIO(f.read())).shape)
(4096, 4096)
CPU times: user 326 ms, sys: 446 ms, total: 771 ms
Wall time: 2.43 s

I observed the same situation with pickle, too.

This happens to neither local filesytem nor HDFS.

@kuenishi kuenishi added this to the 2.1.0 milestone Dec 22, 2021
@kuenishi kuenishi added the cat:performance Performance in terms of speed or memory consumption. label Dec 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:performance Performance in terms of speed or memory consumption.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants