[Datasets] [Operator Fusion - 2/N] Data layer performance/bug fixes and tweaks. #32744

clarkzinzow · 2023-02-22T20:34:56Z

This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in #32178, along with some minor changes. These fixes/changes include:

Logical plan not being propagated in ds.lazy() call (this isn't a perf fix, just a regular ol' bug fix): 8591ac6
~~DatasetContext is generically set via the cached_remote_fn wrapper, reducing redundant code~~ DatasetContext may be modified after the remote function has been cached (e.g. when reading a CSV dataset twice), so we still need to pass through the DatasetContext at task submission time.
Don't use Pandas OptionContext to ignore chained assignment warnings, since this is surprisingly expensive; use normal warnings filter instead: 8bcb07e
Misc. perf improvements: reduce redundant copies during batching, defer size estimation until it's actually needed, perform simple block size estimation in batches.

The performance optimizations produce outsized improvements on the zero-copy adapters benchmark:

After adding SizeEstimator.add_block() that applies the same size estimation logic (add to the weighted running mean every N rows) to blocks instead of individual rows, I saw a 10x perf improvement for a simple blocks benchmark on 1k blocks and 10k rows per block.
Deferring size accumulation in the tabular BlockBuilders until it’s actually asked for by a wrapping BlockOutputBuffer (using a calculated size cursor and a running sum that’s updated on each get_estimated_memory_usage() call), I’m able to get a 2x perf improvement for a Pandas batching benchmark on 1k blocks, 10k rows per block, and 2k rows per batch (and this is a 3x improvement compared to the legacy operator fusion).
Changing our disabling of the chained assignment warning from our tensor extension casting with the Pandas OptionContext to use a normal warnings.simplefilter 2xed the performance of a benchmark that fuses 2 consecutive MapBatches operations.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…r actor pool actors.

…ext.

ericl · 2023-02-22T21:58:59Z

python/ray/data/_internal/delegating_block_builder.py

@@ -55,6 +55,11 @@ def add_block(self, block: Block):
            self._builder = accessor.builder()
        self._builder.add_block(block)

+    def will_build_yield_copy(self) -> bool:
+        if self._builder is None:
+            return True


Maybe False by default?

@ericl we'll technically create a new (empty) block in this case, which I think we should consider to be a "copy" in the sense that the returned block doesn't point to any old data buffers (this method is returning whether building will yield a new block, not whether building will copy data). The Batcher currently uses this method to determine whether we need to copy the built block in order to ensure that no old data buffers are still being referenced, so we can respect the zero_copy_batch=False, ensure_copy=True case.

ericl · 2023-02-22T22:04:01Z

After adding SizeEstimator.add_block() that applies the same size estimation logic (add to the weighted running mean every N rows) to blocks instead of individual rows, I saw a 10x perf improvement for a simple blocks benchmark on 1k blocks and 10k rows per block.

I'm a bit confused why this one improved perf actually, isn't the new code just refactored and does the same equivalent sampling?

clarkzinzow · 2023-02-23T16:20:25Z

I'm a bit confused why this one improved perf actually, isn't the new code just refactored and does the same equivalent sampling?

Yep, it does the equivalent sampling! IIRC most of the overhead was going through the size estimation for every row in the block, rather than a single batched operation for the block. When adding a simple block containing 10k rows, this is the difference between doing 10k func calls (with probably a lot of branch mispredictions) and a total of 10 + 10 + 10 = 30 size estimations (pickle roundtrips), vs a single func call with 10 size estimations.

…s set for actor pool actors." This reverts commit 8ea5f30.

…nd tweaks. (ray-project#32744) This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…nd tweaks. (ray-project#32744) This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes.

…nd tweaks. (ray-project#32744) This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes. Signed-off-by: elliottower <elliot@elliottower.com>

clarkzinzow added 4 commits February 22, 2023 20:09

Propagate logical plan on ds.lazy() call.

8591ac6

Move DatasetContext setting to cached_remote_fn, make sure its set fo…

8ea5f30

…r actor pool actors.

Don't ignore Pandas chained assignment warning with Pandas OptionCont…

8bcb07e

…ext.

Misc. perf improvements to data layer.

c394a72

clarkzinzow requested review from ericl, scv119, jjyao, jianoaix and c21 as code owners February 22, 2023 20:34

Revert undesired changes.

17e5a3e

clarkzinzow mentioned this pull request Feb 22, 2023

[Datasets] [Operator Fusion - 5/N] Add metrics collection to data layer. #32749

Closed

11 tasks

clarkzinzow assigned ericl, c21 and jianoaix Feb 22, 2023

ericl approved these changes Feb 22, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 22, 2023

c21 approved these changes Feb 22, 2023

View reviewed changes

clarkzinzow added 4 commits February 23, 2023 17:38

Fix size estimator.

ae3c3cd

Revert "Move DatasetContext setting to cached_remote_fn, make sure it…

d907b4b

…s set for actor pool actors." This reverts commit 8ea5f30.

Make sure DatasetContext is set for ActorPoolMapOperator.

e0b3854

Fix off-by-one error for table cursor.

d63d4aa

clarkzinzow merged commit b081dd2 into ray-project:master Feb 23, 2023

clarkzinzow mentioned this pull request Mar 31, 2023

[Datasets] [Operator Fusion - 5/N] Add metrics instrumentation to data layer. #33983

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] [Operator Fusion - 2/N] Data layer performance/bug fixes and tweaks. #32744

[Datasets] [Operator Fusion - 2/N] Data layer performance/bug fixes and tweaks. #32744

clarkzinzow commented Feb 22, 2023 •

edited

Loading

ericl Feb 22, 2023

clarkzinzow Feb 23, 2023 •

edited

Loading

ericl commented Feb 22, 2023

clarkzinzow commented Feb 23, 2023 •

edited

Loading

[Datasets] [Operator Fusion - 2/N] Data layer performance/bug fixes and tweaks. #32744

[Datasets] [Operator Fusion - 2/N] Data layer performance/bug fixes and tweaks. #32744

Conversation

clarkzinzow commented Feb 22, 2023 • edited Loading

Checks

ericl Feb 22, 2023

Choose a reason for hiding this comment

clarkzinzow Feb 23, 2023 • edited Loading

Choose a reason for hiding this comment

ericl commented Feb 22, 2023

clarkzinzow commented Feb 23, 2023 • edited Loading

clarkzinzow commented Feb 22, 2023 •

edited

Loading

clarkzinzow Feb 23, 2023 •

edited

Loading

clarkzinzow commented Feb 23, 2023 •

edited

Loading