[Datasets] [Operator Fusion - E2E Mono-PR] [DO-NOT-MERGE] Add zero-copy operator fusion. #32178

clarkzinzow · 2023-02-01T20:41:25Z

This PR adds support for zero-copy operator fusion, where we no longer materialize large blocks in between fused operator transformations. This is done by:

Changing the fundamental generated transform during planning from Iterator[Block] -> Iterator[Block] to Iterator[DataT] -> Iterator[DataT], where DataT = Union[Block DataBatch, Row].
Extracting all batching, formatting, and output buffering logic out into transform adapters.
Fusing all operators in a fusable linear chain at once, rather than fusing operators pairwise.
Creating the appropriate adapters between input blocks and the first operator, between fused operators, and between the last operator and the output blocks.

Most of this PR diff is adding full adapter coverage (both in functionality and in tests) for block-, batch-, and row-based transforms, where each of the transform data types need to be converted into the others without unnecessary copies/materializations. In addition, we have to handle the adapter between transform outputs and the eventual block outputs of the physical operator, which need to be buffered and split according to our dynamic block splitting policy.

TODOs

Add e2e test coverage of the zero-copy semantics.
Add unit tests for adapters.
Add benchmark.
Add perf fixes.
Consolidate legacy planner path and new planner path (as much as possible).
Add copying and batch conversion metrics for operators, along with extended test coverage.
Split into stacked PRs.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2023-02-02T06:55:33Z

python/ray/data/tests/test_execution_optimizer.py

+    logical_plan = LogicalPlan(map_op2)
+    physical_plan = planner.plan(logical_plan)
+    physical_plan = PhysicalOptimizer().optimize(physical_plan)
+    op = physical_plan.dag


Can we improve the testing / observability by emitting metrics on batch conversion operations and overheads? Then we can just assert the expected conversion in the metrics, e.g. {"numpy_to_pandas_conversions": 5} etc.

This could go with the other extra metrics emitted by operators.

Ah that's a great idea! 🙌 I'll look into adding that.

Should we do this?

ericl

Any microbenchmark results?

clarkzinzow · 2023-02-02T14:05:37Z

Any microbenchmark results?

@ericl I'll do some before-and-after for a few pathological cases!

@c21 Do you have the benchmark/reproduction on-hand that highlighted this issue? Would love to validate this against that!

c21 · 2023-02-02T16:36:03Z

@clarkzinzow - yeah we can start with map_batches benchmark in nightly test.

clarkzinzow · 2023-02-02T18:51:21Z

@c21 Just kicked off that nightly test, but I just realized that it won't use the new optimizer since it isn't enabled by default in master. I'll try running it locally with and without the change.

c21 · 2023-02-02T18:53:48Z

@clarkzinzow - can you add ray.data.context.DatasetContext.get_current().optimizer_enabled = True inside map_batches_benchmark.py in this PR? So you can kick off the test?

ericl

Meta-comment: is it possible to extract a interface PR to review the high level strategy first, followed by others?

ericl · 2023-02-14T07:19:46Z

python/ray/data/_internal/arrow_block.py

@@ -224,7 +228,7 @@ def schema(self) -> "pyarrow.lib.Schema":
    def to_pandas(self) -> "pandas.DataFrame":
        from ray.air.util.data_batch_conversion import _cast_tensor_columns_to_ndarrays

-        df = self._table.to_pandas()
+        df = self._table.to_pandas(use_threads=False)


Any particular reason for this / document the reason for this? Naively it seems like we'd want to use threads.

I didn't notice having any positive or negative effect on my benchmarking after disabling multithreaded Pandas conversion, so I thought I'd set this to False similar to how we disable multi-threaded Arrow I/O, and only enable it if it showed speedups in the benchmarking.

I think I'll probably set this back to the default for now when I split out the PRs, though.

clarkzinzow · 2023-02-14T16:47:14Z

@ericl As mentioned in standup and in the PR description, the plan is to decompose this PR into a stack, this PR is just the end-to-end mono-PR used for making sure that everything works and for benchmarking the end-state.

…nd tweaks. (#32744) This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in #32178, along with some minor changes.

…nd tweaks. (ray-project#32744) This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…nd tweaks. (ray-project#32744) This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes.

stale · 2023-03-24T21:55:59Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

…nd tweaks. (ray-project#32744) This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes. Signed-off-by: elliottower <elliot@elliottower.com>

stale · 2023-05-01T13:29:30Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale · 2023-05-18T23:04:00Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

clarkzinzow marked this pull request as ready for review February 2, 2023 03:16

clarkzinzow requested review from ericl, scv119, jjyao, jianoaix and c21 as code owners February 2, 2023 03:16

clarkzinzow assigned ericl, c21 and jianoaix Feb 2, 2023

clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch from 91d0b62 to 1babbf6 Compare February 2, 2023 03:20

ericl reviewed Feb 2, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 8, 2023

clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch from 49c2941 to 05a4d4d Compare February 8, 2023 23:19

ericl reviewed Feb 14, 2023

View reviewed changes

clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch 2 times, most recently from 077ffb2 to a6d862a Compare February 22, 2023 20:14

clarkzinzow mentioned this pull request Feb 22, 2023

[Datasets] [Operator Fusion - 2/N] Data layer performance/bug fixes and tweaks. #32744

Merged

7 tasks

clarkzinzow changed the title ~~[Datasets] [Operator Fusion - 2/2] Add zero-copy operator fusion.~~ [Datasets] [Operator Fusion - E2E Mono-PR] [DO-NOT-MERGE] Add zero-copy operator fusion. Feb 22, 2023

clarkzinzow added the do-not-merge Do not merge this PR! label Feb 22, 2023

clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch from a6d862a to 83fe52a Compare February 22, 2023 21:08

clarkzinzow added 4 commits February 22, 2023 21:09

Misc. perf improvements and fixes.

56c3728

Add operator fusion benchmark.

0634051

Add metrics collection to data layer and MapOperator.

503b2da

Add zero-copy operator fusion.

142791e

Turn on optimizer for map_batches benchmark to gauge nightly test.

b51ac54

clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch from 83fe52a to b51ac54 Compare February 22, 2023 21:10

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 24, 2023

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 29, 2023

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 1, 2023

stale bot closed this May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] [Operator Fusion - E2E Mono-PR] [DO-NOT-MERGE] Add zero-copy operator fusion. #32178

[Datasets] [Operator Fusion - E2E Mono-PR] [DO-NOT-MERGE] Add zero-copy operator fusion. #32178

clarkzinzow commented Feb 1, 2023 •

edited

Loading

ericl Feb 2, 2023 •

edited

Loading

clarkzinzow Feb 2, 2023

ericl Feb 8, 2023

ericl left a comment

clarkzinzow commented Feb 2, 2023 •

edited

Loading

c21 commented Feb 2, 2023

clarkzinzow commented Feb 2, 2023 •

edited

Loading

c21 commented Feb 2, 2023

ericl left a comment •

edited

Loading

ericl Feb 14, 2023

clarkzinzow Feb 14, 2023

clarkzinzow commented Feb 14, 2023

stale bot commented Mar 24, 2023

stale bot commented May 1, 2023

stale bot commented May 18, 2023

[Datasets] [Operator Fusion - E2E Mono-PR] [DO-NOT-MERGE] Add zero-copy operator fusion. #32178

[Datasets] [Operator Fusion - E2E Mono-PR] [DO-NOT-MERGE] Add zero-copy operator fusion. #32178

Conversation

clarkzinzow commented Feb 1, 2023 • edited Loading

TODOs

Checks

ericl Feb 2, 2023 • edited Loading

Choose a reason for hiding this comment

clarkzinzow Feb 2, 2023

Choose a reason for hiding this comment

ericl Feb 8, 2023

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow commented Feb 2, 2023 • edited Loading

c21 commented Feb 2, 2023

clarkzinzow commented Feb 2, 2023 • edited Loading

c21 commented Feb 2, 2023

ericl left a comment • edited Loading

Choose a reason for hiding this comment

ericl Feb 14, 2023

Choose a reason for hiding this comment

clarkzinzow Feb 14, 2023

Choose a reason for hiding this comment

clarkzinzow commented Feb 14, 2023

stale bot commented Mar 24, 2023

stale bot commented May 1, 2023

stale bot commented May 18, 2023

clarkzinzow commented Feb 1, 2023 •

edited

Loading

ericl Feb 2, 2023 •

edited

Loading

clarkzinzow commented Feb 2, 2023 •

edited

Loading

clarkzinzow commented Feb 2, 2023 •

edited

Loading

ericl left a comment •

edited

Loading