Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add benchmark for many file parquet reads #33222

Merged
merged 9 commits into from
Mar 15, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Mar 11, 2023

Why are these changes needed?

See #33116
Workspace configured to run this benchmark

Related issue number

Closes #33116

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Scott Lee added 2 commits March 10, 2023 17:27
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee marked this pull request as ready for review March 11, 2023 03:25

# Test reading many small files.
total_rows = 1024
for num_files in [10000, 20000, 50000]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pre-generate 50k files instead and put them on S3? That'll test the S3 interaction/metadata fetch etc.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Comment on lines +105 to +108
benchmark.run(
test_name,
read_parquet,
root=many_files_dir,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how was the performance (runtime) looking like? Can you also launch a run for a commit without the fix #33117 ? To see if there's any performance difference? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to complete a run with 50K files, due to some network timeout issues while reading from S3. I ran with 5K files:

  • runtime with fix (s): 335.63s
  • runtime without fix (s): 349.11s

The delta is not super large for 5K files, but I expect it would be roughly proportional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvement looks pretty margin here.. worth to digging into separately (non-blocking)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottjlee - can you create a github issue for investigating the perf of DefaultFileMetaProvider? It can wait for next week when Clark is back.

Scott Lee added 3 commits March 14, 2023 12:02
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@@ -84,6 +84,30 @@ def run_read_parquet_benchmark(benchmark: Benchmark):
for dir in data_dirs:
shutil.rmtree(dir)

# Test reading many small files.
# TODO: Once performance is further improved, increase to 50K files.
Copy link
Contributor Author

@scottjlee scottjlee Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running the benchmark with 50K files, it does not complete due to the following timeout error (many such errors, but this is one example):

(_execute_read_task_split pid=55994) 2023-03-14 12:01:35,873    INFO worker.py:774 -- Task failed with retryable exception: TaskID(07d2bb113abb5e02ffffffffffffffffffffffff03000000).
(_execute_read_task_split pid=55994) Traceback (most recent call last):
(_execute_read_task_split pid=55994)   File "python/ray/_raylet.pyx", line 642, in ray._raylet.execute_dynamic_generator_and_store_task_outputs
(_execute_read_task_split pid=55994)   File "python/ray/_raylet.pyx", line 2506, in ray._raylet.CoreWorker.store_task_outputs
(_execute_read_task_split pid=55994)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/lazy_block_list.py", line 689, in _execute_read_task_split
(_execute_read_task_split pid=55994)     for block in blocks:
(_execute_read_task_split pid=55994)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/datasource.py", line 215, in __call__
(_execute_read_task_split pid=55994)     for block in result:
(_execute_read_task_split pid=55994)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/parquet_datasource.py", line 392, in _read_pieces
(_execute_read_task_split pid=55994)     for batch in batches:
(_execute_read_task_split pid=55994)   File "pyarrow/_dataset.pyx", line 2783, in _iterator
(_execute_read_task_split pid=55994)   File "pyarrow/_dataset.pyx", line 2342, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
(_execute_read_task_split pid=55994)   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
(_execute_read_task_split pid=55994)   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
(_execute_read_task_split pid=55994) OSError: When reading information for key 'read-many-parquet-files/input_data_5415.parquet.snappy' in bucket 'air-example-data-2': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A libcurl function was given a bad argument

We should further investigate improving this performance in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try this on eg. 10 nodes cluster? If that works, then we can set num files to a smaller number here.

Copy link
Contributor Author

@scottjlee scottjlee Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a 10-node cluster, ran with 50K files:

Also, although this run succeeds, it still frequently runs into the same timeout errors above.

Scott Lee added 3 commits March 14, 2023 21:02
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee
Copy link
Contributor Author

scottjlee commented Mar 15, 2023

Remaining Documentation test failure appears to be unrelated, and started with #33309

@scottjlee scottjlee added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 15, 2023
@scottjlee scottjlee requested review from jianoaix and c21 March 15, 2023 07:04
@c21
Copy link
Contributor

c21 commented Mar 15, 2023

LG, cc @ericl if this can be merged.

@ericl ericl merged commit 69c3390 into ray-project:master Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data][streaming] Add benchmark for many file reading
4 participants