[Datasets] Add benchmark for many file parquet reads #33222

scottjlee · 2023-03-11T01:28:17Z

Why are these changes needed?

See #33116
Workspace configured to run this benchmark

Related issue number

Closes #33116

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

jianoaix · 2023-03-11T04:55:43Z

release/nightly_tests/dataset/read_parquet_benchmark.py

+
+    # Test reading many small files.
+    total_rows = 1024
+    for num_files in [10000, 20000, 50000]:


Can we pre-generate 50k files instead and put them on S3? That'll test the S3 interaction/metadata fetch etc.

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 · 2023-03-14T00:04:23Z

release/nightly_tests/dataset/read_parquet_benchmark.py

+    benchmark.run(
+        test_name,
+        read_parquet,
+        root=many_files_dir,


how was the performance (runtime) looking like? Can you also launch a run for a commit without the fix #33117 ? To see if there's any performance difference? Thanks.

I haven't been able to complete a run with 50K files, due to some network timeout issues while reading from S3. I ran with 5K files:

runtime with fix (s): 335.63s

runtime without fix (s): 349.11s

The delta is not super large for 5K files, but I expect it would be roughly proportional.

The improvement looks pretty margin here.. worth to digging into separately (non-blocking)

@scottjlee - can you create a github issue for investigating the perf of DefaultFileMetaProvider? It can wait for next week when Clark is back.

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-03-14T19:06:11Z

release/nightly_tests/dataset/read_parquet_benchmark.py

@@ -84,6 +84,30 @@ def run_read_parquet_benchmark(benchmark: Benchmark):
    for dir in data_dirs:
        shutil.rmtree(dir)

+    # Test reading many small files.
+    # TODO: Once performance is further improved, increase to 50K files.


When running the benchmark with 50K files, it does not complete due to the following timeout error (many such errors, but this is one example):

(_execute_read_task_split pid=55994) 2023-03-14 12:01:35,873 INFO worker.py:774 -- Task failed with retryable exception: TaskID(07d2bb113abb5e02ffffffffffffffffffffffff03000000). (_execute_read_task_split pid=55994) Traceback (most recent call last): (_execute_read_task_split pid=55994) File "python/ray/_raylet.pyx", line 642, in ray._raylet.execute_dynamic_generator_and_store_task_outputs (_execute_read_task_split pid=55994) File "python/ray/_raylet.pyx", line 2506, in ray._raylet.CoreWorker.store_task_outputs (_execute_read_task_split pid=55994) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/lazy_block_list.py", line 689, in _execute_read_task_split (_execute_read_task_split pid=55994) for block in blocks: (_execute_read_task_split pid=55994) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/datasource.py", line 215, in __call__ (_execute_read_task_split pid=55994) for block in result: (_execute_read_task_split pid=55994) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/datasource/parquet_datasource.py", line 392, in _read_pieces (_execute_read_task_split pid=55994) for batch in batches: (_execute_read_task_split pid=55994) File "pyarrow/_dataset.pyx", line 2783, in _iterator (_execute_read_task_split pid=55994) File "pyarrow/_dataset.pyx", line 2342, in pyarrow._dataset.TaggedRecordBatchIterator.__next__ (_execute_read_task_split pid=55994) File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status (_execute_read_task_split pid=55994) File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status (_execute_read_task_split pid=55994) OSError: When reading information for key 'read-many-parquet-files/input_data_5415.parquet.snappy' in bucket 'air-example-data-2': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A libcurl function was given a bad argument

We should further investigate improving this performance in a separate PR.

Can you try this on eg. 10 nodes cluster? If that works, then we can set num files to a smaller number here.

On a 10-node cluster, ran with 50K files:

with fix [Datasets] Improve performance of DefaultFileMetaProvider. #33117: 350.86s (7.2% decrease)

without fix: 378.13s

Also, although this run succeeds, it still frequently runs into the same timeout errors above.

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee · 2023-03-15T07:03:38Z

Remaining Documentation test failure appears to be unrelated, and started with #33309

c21 · 2023-03-15T17:35:39Z

LG, cc @ericl if this can be merged.

See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark Signed-off-by: Jack He <jackhe2345@gmail.com>

See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark

See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark Signed-off-by: elliottower <elliot@elliottower.com>

See ray-project#33116 [Workspace](https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_lzau8t9s749ytcqmvjdgxx9k/ses_fnbvr4gp3paaidjpsmyhmbtx?command-history-section=head_start_up_log) configured to run this benchmark Signed-off-by: Jack He <jackhe2345@gmail.com>

Scott Lee added 2 commits March 10, 2023 17:27

add many file read to existing benchmark

45b3a1c

Signed-off-by: Scott Lee <sjl@anyscale.com>

increase timeout for additional tests

47c5660

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review March 11, 2023 03:25

scottjlee assigned c21 and jianoaix Mar 11, 2023

jianoaix reviewed Mar 11, 2023

View reviewed changes

working benchmark

6164083

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 approved these changes Mar 14, 2023

View reviewed changes

Scott Lee added 3 commits March 14, 2023 12:02

clean up

e5e598d

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into bchmk-read-manyfile

2ce050b

Signed-off-by: Scott Lee <sjl@anyscale.com>

undo bad test renaming

4270479

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee commented Mar 14, 2023

View reviewed changes

Scott Lee added 3 commits March 14, 2023 21:02

use 1k files

814542e

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into bchmk-read-manyfile

284f2fb

Signed-off-by: Scott Lee <sjl@anyscale.com>

clean up

1971c49

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 15, 2023

scottjlee requested review from jianoaix and c21 March 15, 2023 07:04

jianoaix approved these changes Mar 15, 2023

View reviewed changes

c21 assigned ericl Mar 15, 2023

ericl merged commit 69c3390 into ray-project:master Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add benchmark for many file parquet reads #33222

[Datasets] Add benchmark for many file parquet reads #33222

scottjlee commented Mar 11, 2023 •

edited

Loading

jianoaix Mar 11, 2023

c21 Mar 14, 2023

scottjlee Mar 14, 2023

c21 Mar 14, 2023

c21 Mar 15, 2023

scottjlee Mar 14, 2023 •

edited

Loading

jianoaix Mar 14, 2023

scottjlee Mar 15, 2023 •

edited

Loading

scottjlee commented Mar 15, 2023 •

edited

Loading

c21 commented Mar 15, 2023

[Datasets] Add benchmark for many file parquet reads #33222

[Datasets] Add benchmark for many file parquet reads #33222

Conversation

scottjlee commented Mar 11, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

jianoaix Mar 11, 2023

Choose a reason for hiding this comment

c21 Mar 14, 2023

Choose a reason for hiding this comment

scottjlee Mar 14, 2023

Choose a reason for hiding this comment

c21 Mar 14, 2023

Choose a reason for hiding this comment

c21 Mar 15, 2023

Choose a reason for hiding this comment

scottjlee Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

jianoaix Mar 14, 2023

Choose a reason for hiding this comment

scottjlee Mar 15, 2023 • edited Loading

Choose a reason for hiding this comment

scottjlee commented Mar 15, 2023 • edited Loading

c21 commented Mar 15, 2023

scottjlee commented Mar 11, 2023 •

edited

Loading

scottjlee Mar 14, 2023 •

edited

Loading

scottjlee Mar 15, 2023 •

edited

Loading

scottjlee commented Mar 15, 2023 •

edited

Loading