[Datasets] Improve performance of DefaultFileMetaProvider. #33117

clarkzinzow · 2023-03-07T22:57:17Z

This PR improves the performance of the DefaultFileMetaProvider. Previously, DefaultFileMetaProvider would serially expand and fetch the file size for a large list of directories and files, respectively. This PR optimizes this by parallelizing directory expansion and file size fetching over Ray tasks. Also, in the common case that all file paths share the same parent directory (or base directory, if using partitioning), we do a single ListObjectsV2 call on the directory followed by a client-side filter, which reduces a 90 second parallel file size fetch to a 0.8 second request + client-side filter.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/_internal/util.py

python/ray/data/datasource/file_meta_provider.py

clarkzinzow · 2023-03-08T19:23:49Z

@ericl Feedback implemented, PTAL

ericl

Few comments on testing.

ericl · 2023-03-08T19:34:27Z

python/ray/data/datasource/file_based_datasource.py

+        # Always launch at least 2 parallel fetch tasks.
+        max(len(uris) // desired_uris_per_task, 2),
+        # Oversubscribe cluster CPU by 2x since these tasks are I/O-bound.
+        round(available_cpus / num_cpus),


I think we can remove line 801, since as long as the tasks are long enough it doesn't matter to have more tasks after that.

Ah good point!

Btw, this ended up leading to pretty big slowdowns of the 10k files and 40k files cases, ~2x slower in both cases, so the task scheduling/dispatching overhead is still non-negligible upon queueing. Would you still vote for removing this?

Interesting. I'd still vote to remove it, since we don't understand why it should be faster (40k/16 = 2500 tasks, which should be less than a second of overhead).

Could it be using 0.5 CPUs is actually slower than 1 CPU due to worker startup time? I don't think the worker pool caches more than 1 worker per CPU.

It actually ended up being the 0.5 CPUs vs 0.25 CPUs difference, with the latter being 2x faster. Keeping num_cpus fixed and removing this CPU oversubscription bound resulted in ~ the same performance.

I wouldn't expect worker startup time to be the issue here, since we're talking about queued, short tasks or large (e.g. 40 second) tasks, with CPU oversubscription in both cases, so we're comparing remaining stalled in the queue or starting up more workers to help drain the queue. In either the short or large task case, more workers would allow us to drain the queue faster.

python/ray/data/datasource/file_meta_provider.py

python/ray/data/tests/test_dataset_csv.py

c21

LGTM, thanks @clarkzinzow!

python/ray/data/datasource/file_meta_provider.py

…ing perf.

jianoaix

LGTM

jianoaix · 2023-03-09T01:11:40Z

python/ray/data/datasource/file_based_datasource.py

+
+def _fetch_metadata_parallel(
+    uris: List[Uri],
+    fetch_func: Callable[[List[Uri]], List[Meta]],


The _fetch_metadata_serialization_wrapper look taking _SerializedPiece typed arg.

Yep, and the provided uris arg is a List[_SerializedPiece], so the Uri type is consistent across these two args, which is all that we're trying to express here (fetch_func should take the same type as uris as an argument). I couldn't think of a better type name than Uri, since both file paths and Parquet fragments can be thought of as an identifier for data to be read, so Uri seemed like an ok choice.

clarkzinzow · 2023-03-09T21:44:10Z

Failures appear to be unrelated, merging!

…ct#33117) This PR improves the performance of the DefaultFileMetaProvider. Previously, DefaultFileMetaProvider would serially expand and fetch the file size for a large list of directories and files, respectively. This PR optimizes this by parallelizing directory expansion and file size fetching over Ray tasks. Also, in the common case that all file paths share the same parent directory (or base directory, if using partitioning), we do a single ListObjectsV2 call on the directory followed by a client-side filter, which reduces a 90 second parallel file size fetch to a 0.8 second request + client-side filter. Signed-off-by: Jack He <jackhe2345@gmail.com>

…ct#33117) This PR improves the performance of the DefaultFileMetaProvider. Previously, DefaultFileMetaProvider would serially expand and fetch the file size for a large list of directories and files, respectively. This PR optimizes this by parallelizing directory expansion and file size fetching over Ray tasks. Also, in the common case that all file paths share the same parent directory (or base directory, if using partitioning), we do a single ListObjectsV2 call on the directory followed by a client-side filter, which reduces a 90 second parallel file size fetch to a 0.8 second request + client-side filter. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ct#33117) This PR improves the performance of the DefaultFileMetaProvider. Previously, DefaultFileMetaProvider would serially expand and fetch the file size for a large list of directories and files, respectively. This PR optimizes this by parallelizing directory expansion and file size fetching over Ray tasks. Also, in the common case that all file paths share the same parent directory (or base directory, if using partitioning), we do a single ListObjectsV2 call on the directory followed by a client-side filter, which reduces a 90 second parallel file size fetch to a 0.8 second request + client-side filter.

…ct#33117) This PR improves the performance of the DefaultFileMetaProvider. Previously, DefaultFileMetaProvider would serially expand and fetch the file size for a large list of directories and files, respectively. This PR optimizes this by parallelizing directory expansion and file size fetching over Ray tasks. Also, in the common case that all file paths share the same parent directory (or base directory, if using partitioning), we do a single ListObjectsV2 call on the directory followed by a client-side filter, which reduces a 90 second parallel file size fetch to a 0.8 second request + client-side filter. Signed-off-by: elliottower <elliot@elliottower.com>

…ct#33117) This PR improves the performance of the DefaultFileMetaProvider. Previously, DefaultFileMetaProvider would serially expand and fetch the file size for a large list of directories and files, respectively. This PR optimizes this by parallelizing directory expansion and file size fetching over Ray tasks. Also, in the common case that all file paths share the same parent directory (or base directory, if using partitioning), we do a single ListObjectsV2 call on the directory followed by a client-side filter, which reduces a 90 second parallel file size fetch to a 0.8 second request + client-side filter. Signed-off-by: Jack He <jackhe2345@gmail.com>

clarkzinzow requested review from ericl, scv119, jjyao, jianoaix and c21 as code owners March 7, 2023 22:57

clarkzinzow assigned ericl, c21 and jianoaix Mar 7, 2023

Performance improvement to DefaultFileMetaProvider.

03e5ed1

clarkzinzow force-pushed the datasets/feat/meta-provider-performance branch from 4a196be to 03e5ed1 Compare March 7, 2023 23:01

ericl requested changes Mar 8, 2023

View reviewed changes

python/ray/data/_internal/util.py Outdated Show resolved Hide resolved

python/ray/data/datasource/file_meta_provider.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 8, 2023

clarkzinzow added 3 commits March 8, 2023 18:33

Change CPUs per task to 0.5

335304e

Test fixes.

87dd444

Consolidate parallel metadata fetching.

6a3f844

ericl approved these changes Mar 8, 2023

View reviewed changes

c21 approved these changes Mar 8, 2023

View reviewed changes

python/ray/data/datasource/file_meta_provider.py Outdated Show resolved Hide resolved

python/ray/data/datasource/file_meta_provider.py Outdated Show resolved Hide resolved

python/ray/data/datasource/file_meta_provider.py Outdated Show resolved Hide resolved

clarkzinzow added 3 commits March 8, 2023 21:04

Feedback and fixes.

8476e01

Add unit test coverage.

de54b43

Remove unnecessary common path normalization, TODO comment for improv…

aa9678b

…ing perf.

jianoaix reviewed Mar 9, 2023

View reviewed changes

Fix log propagation.

4c66a44

clarkzinzow merged commit d2480a6 into ray-project:master Mar 9, 2023

c21 mentioned this pull request Mar 14, 2023

[Datasets] Add benchmark for many file parquet reads #33222

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Improve performance of DefaultFileMetaProvider. #33117

[Datasets] Improve performance of DefaultFileMetaProvider. #33117

clarkzinzow commented Mar 7, 2023

clarkzinzow commented Mar 8, 2023

ericl left a comment

ericl Mar 8, 2023

clarkzinzow Mar 8, 2023 •

edited

Loading

clarkzinzow Mar 8, 2023

ericl Mar 8, 2023

ericl Mar 8, 2023

clarkzinzow Mar 8, 2023 •

edited

Loading

c21 left a comment

jianoaix left a comment

jianoaix Mar 9, 2023

clarkzinzow Mar 9, 2023

clarkzinzow commented Mar 9, 2023

[Datasets] Improve performance of DefaultFileMetaProvider. #33117

[Datasets] Improve performance of DefaultFileMetaProvider. #33117

Conversation

clarkzinzow commented Mar 7, 2023

Checks

clarkzinzow commented Mar 8, 2023

ericl left a comment

Choose a reason for hiding this comment

ericl Mar 8, 2023

Choose a reason for hiding this comment

clarkzinzow Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

clarkzinzow Mar 8, 2023

Choose a reason for hiding this comment

ericl Mar 8, 2023

Choose a reason for hiding this comment

ericl Mar 8, 2023

Choose a reason for hiding this comment

clarkzinzow Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

jianoaix left a comment

Choose a reason for hiding this comment

jianoaix Mar 9, 2023

Choose a reason for hiding this comment

clarkzinzow Mar 9, 2023

Choose a reason for hiding this comment

clarkzinzow commented Mar 9, 2023

clarkzinzow Mar 8, 2023 •

edited

Loading

clarkzinzow Mar 8, 2023 •

edited

Loading