-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buffered S3 reader #247
Buffered S3 reader #247
Conversation
d5f1f6c
to
d1d0360
Compare
Thank you for very thoughtful suggestion with real benchmarks. I'd welcome the idea and that's what I wanted to do, indicated in the TODO comment. My suggestion would be add
Thus the default value of 32MB would be nice, but I want to leave a mean for app developers to configure the buffer size. |
1a085c6
to
52d5cf5
Compare
/test |
Successfully created a job for commit 52d5cf5: |
1 similar comment
Successfully created a job for commit 52d5cf5: |
7419f2e
to
b322a63
Compare
/test |
Successfully created a job for commit b322a63: |
1 similar comment
Successfully created a job for commit b322a63: |
b322a63
to
de1d0cb
Compare
Although the performance benchmark would heavily depend on S3 configurations, network conditions and the content, I conducted a small toy single-stream data read throughput benchmark with varying file size and buffer size on our internal Ozone cluster. Benchmark codeimport os
import pickle
import random
import string
import time
import pfio
import numpy as np
def rand_str(n):
return ''.join(random.choices(string.ascii_letters + string.digits, k=n))
def bench(path, n_loop=5):
print('mode', mode)
print("| File size (MiB) | buffer size (MiB) | time (s) | stddev (s) | throughput (MiB/s) |")
print("|:----|:----|:----|:----|:----|")
for k in range(-2, 13, 2):
# approx 2^k MB of pickle data.
size = int(1024 * (2 ** k))
data = {rand_str(32): rand_str(1024) for _ in range(size)}
with pfio.v2.open_url(path, 'wb') as f:
pickle.dump(data, f)
with pfio.v2.from_url('s3://<bucket>/') as fs:
actual_size = fs.stat(os.path.basename(path)).size
for s in range(11):
buffer_size = int(1024 * 1024 * (2 ** s))
times = []
for _ in range(n_loop):
with pfio.v2.open_url(path, 'rb', buffering=buffer_size) as f:
before = time.time()
loaded_data = pickle.load(f)
assert list(loaded_data.keys())[0] == list(data.keys())[0]
after = time.time()
times.append(after - before)
times = np.array(times)
throughput = (actual_size / times.mean()) / (1024 ** 2)
print('| {:.2f} | {:.0f} | {:.3f} | {:.3f} | {:.3f} |'
.format(actual_size / (1024 ** 2), buffer_size / (1024 ** 2),
times.mean(), times.std(), throughput))
bench("s3://<bucket>/bench.pkl") The full output of the script.
When reading relatively small files (~1MiB), relatively small buffer performs the best. As a conclusion I'd say the buffer size of around 16MiB to 64MiB is the best in my experimental setup. |
/test |
Successfully created a job for commit 6405725: |
Successfully created a job for commit 6405725: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is awesome and looks very good... but, please fix the CI failure of annoying isort check.
/test |
Successfully created a job for commit af6f2f5: |
1 similar comment
Successfully created a job for commit af6f2f5: |
TODO: if the benchmark result with 16MB wasn't that good as expected, we may change the default size later. cc: @belltailjp |
This tries to fix #241 .
What is done in this PR
Main purpose is to provide buffering in S3 reader.
As left in the TODO comment here,
_ObjectReader
provided insufficient feature in order to use withio.BufferedReader
.In this PR I implemented
readall
andreadinto
methods that will be called fromBufferedReader
in_ObjectReader
, and make it subclass ofio.RawIOBase
.This is basically same as how normal filesystem is handled in Python;
open("xxxx", "rb")
returns aBufferedReader
, whose original stream is aFileIO
(unbuffered raw stream for local files) object.There are also several works included.
**_
trap to each filesystem.S3
accepts"buffering"
and some other options inopen_url
andfrom_url
. Similarly other filesystem may have their specific arguments. An option for a specific filesystem type will cause of "an unexpected keyword argument". This makesopen_url
orfrom_url
call filesystem dependent, which hurts pfio transparency power.**_
to__init__
of each filesystem, it can simply ignore unsupported argument tUsage
The buffering can be controlled by
buffering
option in__init__
(suggested by #247 (comment)),buffering=0
: No bufferingbuffering=-1
(default): Use the default buffer size of 32MiBbuffering>0
: Use the specified value (integer) as the buffer sizebuffering
option can be passed tofrom_url
andopen_url
.Buffering option has no effect when opening a file in text mode or write mode.
Performance
I confirmed a significant speedup when loading the same test data (a PIL image) as #241 .
pfio 2.0.1 (brought from #241)
pfio 2.0.1 with BytesIO (brought from #241 as well)
This PR
Thanks to the buffering, now it performs very similar speed to the
BytesIO
workaround.Discussion
I set buffer size 32MiB to
pfio.v2.s3.DEFAULT_BUFFER_SIZE
.When handling local (posix) files, Python uses 8192 bytes as the default buffer size ((
io.DEFAULT_BUFFER_SIZE
](https://docs.python.org/3/library/io.html#io.DEFAULT_BUFFER_SIZE)).However for S3 since HTTP overhead is the way larger than native filesystem, the buffer size should be much larger for efficiency. I thought 32MiB would be a good balance, but maybe it's better to do some sort of micro-benchmarks.
If you suggest some specific values or way to identify the best number, please make an advice 🙇.
Another content
_ObjectReader.readall
was implemented as follows:However this behavior (specifically, seeking to the head as shown as an arrow above) is inconsistent to normal filesystems.
This behavior prevented the buffering to work properly, this PR fixes it too.
References
Python official document provides by far insufficient information. These resources helped me a lot learning how/what to implement in
readinto
.