-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add benchmark for many file parquet reads #33222
Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
|
||
# Test reading many small files. | ||
total_rows = 1024 | ||
for num_files in [10000, 20000, 50000]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pre-generate 50k files instead and put them on S3? That'll test the S3 interaction/metadata fetch etc.
Signed-off-by: Scott Lee <sjl@anyscale.com>
benchmark.run( | ||
test_name, | ||
read_parquet, | ||
root=many_files_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how was the performance (runtime) looking like? Can you also launch a run for a commit without the fix #33117 ? To see if there's any performance difference? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to complete a run with 50K files, due to some network timeout issues while reading from S3. I ran with 5K files:
- runtime with fix (s):
335.63s
- runtime without fix (s):
349.11s
The delta is not super large for 5K files, but I expect it would be roughly proportional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The improvement looks pretty margin here.. worth to digging into separately (non-blocking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottjlee - can you create a github issue for investigating the perf of DefaultFileMetaProvider? It can wait for next week when Clark is back.
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@@ -84,6 +84,30 @@ def run_read_parquet_benchmark(benchmark: Benchmark): | |||
for dir in data_dirs: | |||
shutil.rmtree(dir) | |||
|
|||
# Test reading many small files. | |||
# TODO: Once performance is further improved, increase to 50K files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running the benchmark with 50K files, it does not complete due to the following timeout error (many such errors, but this is one example):
(_execute_read_task_split pid=55994) 2023-03-14 12:01:35,873 INFO worker.py:774 -- Task failed with retryable exception: TaskID(07d2bb113abb5e02ffffffffffffffffffffffff03000000).
(_execute_read_task_split pid=55994) Traceback (most recent call last):
(_execute_read_task_split pid=55994) File "python/ray/_raylet.pyx", line 642, in ray._raylet.execute_dynamic_generator_and_store_task_outputs
(_execute_read_task_split pid=55994) File "python/ray/_raylet.pyx", line 2506, in ray._raylet.CoreWorker.store_task_outputs
(_execute_read_task_split pid=55994) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/lazy_block_list.py", line 689, in _execute_read_task_split
(_execute_read_task_split pid=55994) for block in blocks:
(_execute_read_task_split pid=55994) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/datasource.py", line 215, in __call__
(_execute_read_task_split pid=55994) for block in result:
(_execute_read_task_split pid=55994) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/parquet_datasource.py", line 392, in _read_pieces
(_execute_read_task_split pid=55994) for batch in batches:
(_execute_read_task_split pid=55994) File "pyarrow/_dataset.pyx", line 2783, in _iterator
(_execute_read_task_split pid=55994) File "pyarrow/_dataset.pyx", line 2342, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
(_execute_read_task_split pid=55994) File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
(_execute_read_task_split pid=55994) File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
(_execute_read_task_split pid=55994) OSError: When reading information for key 'read-many-parquet-files/input_data_5415.parquet.snappy' in bucket 'air-example-data-2': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A libcurl function was given a bad argument
We should further investigate improving this performance in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try this on eg. 10 nodes cluster? If that works, then we can set num files to a smaller number here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a 10-node cluster, ran with 50K files:
- with fix [Datasets] Improve performance of DefaultFileMetaProvider. #33117:
350.86s
(7.2% decrease) - without fix:
378.13s
Also, although this run succeeds, it still frequently runs into the same timeout errors above.
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Remaining |
LG, cc @ericl if this can be merged. |
See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark Signed-off-by: Jack He <jackhe2345@gmail.com>
See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark
See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark Signed-off-by: elliottower <elliot@elliottower.com>
See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark Signed-off-by: Jack He <jackhe2345@gmail.com>
Why are these changes needed?
See #33116
Workspace configured to run this benchmark
Related issue number
Closes #33116
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.