Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] [Operator Fusion - E2E Mono-PR] [DO-NOT-MERGE] Add zero-copy operator fusion. #32178

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Feb 1, 2023

This PR adds support for zero-copy operator fusion, where we no longer materialize large blocks in between fused operator transformations. This is done by:

  1. Changing the fundamental generated transform during planning from Iterator[Block] -> Iterator[Block] to Iterator[DataT] -> Iterator[DataT], where DataT = Union[Block DataBatch, Row].
  2. Extracting all batching, formatting, and output buffering logic out into transform adapters.
  3. Fusing all operators in a fusable linear chain at once, rather than fusing operators pairwise.
  4. Creating the appropriate adapters between input blocks and the first operator, between fused operators, and between the last operator and the output blocks.

Most of this PR diff is adding full adapter coverage (both in functionality and in tests) for block-, batch-, and row-based transforms, where each of the transform data types need to be converted into the others without unnecessary copies/materializations. In addition, we have to handle the adapter between transform outputs and the eventual block outputs of the physical operator, which need to be buffered and split according to our dynamic block splitting policy.

TODOs

  • Add e2e test coverage of the zero-copy semantics.
  • Add unit tests for adapters.
  • Add benchmark.
  • Add perf fixes.
  • Consolidate legacy planner path and new planner path (as much as possible).
  • Add copying and batch conversion metrics for operators, along with extended test coverage.
  • Split into stacked PRs.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@clarkzinzow clarkzinzow marked this pull request as ready for review February 2, 2023 03:16
@clarkzinzow clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch from 91d0b62 to 1babbf6 Compare February 2, 2023 03:20
logical_plan = LogicalPlan(map_op2)
physical_plan = planner.plan(logical_plan)
physical_plan = PhysicalOptimizer().optimize(physical_plan)
op = physical_plan.dag
Copy link
Contributor

@ericl ericl Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we improve the testing / observability by emitting metrics on batch conversion operations and overheads? Then we can just assert the expected conversion in the metrics, e.g. {"numpy_to_pandas_conversions": 5} etc.

This could go with the other extra metrics emitted by operators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's a great idea! 🙌 I'll look into adding that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any microbenchmark results?

@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Feb 2, 2023

Any microbenchmark results?

@ericl I'll do some before-and-after for a few pathological cases!

@c21 Do you have the benchmark/reproduction on-hand that highlighted this issue? Would love to validate this against that!

@c21
Copy link
Contributor

c21 commented Feb 2, 2023

@clarkzinzow - yeah we can start with map_batches benchmark in nightly test.

@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Feb 2, 2023

@c21 Just kicked off that nightly test, but I just realized that it won't use the new optimizer since it isn't enabled by default in master. I'll try running it locally with and without the change.

@c21
Copy link
Contributor

c21 commented Feb 2, 2023

@clarkzinzow - can you add ray.data.context.DatasetContext.get_current().optimizer_enabled = True inside map_batches_benchmark.py in this PR? So you can kick off the test?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 8, 2023
@clarkzinzow clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch from 49c2941 to 05a4d4d Compare February 8, 2023 23:19
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta-comment: is it possible to extract a interface PR to review the high level strategy first, followed by others?

@@ -224,7 +228,7 @@ def schema(self) -> "pyarrow.lib.Schema":
def to_pandas(self) -> "pandas.DataFrame":
from ray.air.util.data_batch_conversion import _cast_tensor_columns_to_ndarrays

df = self._table.to_pandas()
df = self._table.to_pandas(use_threads=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for this / document the reason for this? Naively it seems like we'd want to use threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice having any positive or negative effect on my benchmarking after disabling multithreaded Pandas conversion, so I thought I'd set this to False similar to how we disable multi-threaded Arrow I/O, and only enable it if it showed speedups in the benchmarking.

I think I'll probably set this back to the default for now when I split out the PRs, though.

@clarkzinzow
Copy link
Contributor Author

@ericl As mentioned in standup and in the PR description, the plan is to decompose this PR into a stack, this PR is just the end-to-end mono-PR used for making sure that everything works and for benchmarking the end-state.

@clarkzinzow clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch 2 times, most recently from 077ffb2 to a6d862a Compare February 22, 2023 20:14
@clarkzinzow clarkzinzow changed the title [Datasets] [Operator Fusion - 2/2] Add zero-copy operator fusion. [Datasets] [Operator Fusion - E2E Mono-PR] [DO-NOT-MERGE] Add zero-copy operator fusion. Feb 22, 2023
@clarkzinzow clarkzinzow added the do-not-merge Do not merge this PR! label Feb 22, 2023
@clarkzinzow clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch from a6d862a to 83fe52a Compare February 22, 2023 21:08
@clarkzinzow clarkzinzow force-pushed the datasets/feat/zero-copy-operator-fusion branch from 83fe52a to b51ac54 Compare February 22, 2023 21:10
clarkzinzow added a commit that referenced this pull request Feb 23, 2023
…nd tweaks. (#32744)

This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in #32178, along with some minor changes.
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…nd tweaks. (ray-project#32744)

This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
…nd tweaks. (ray-project#32744)

This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes.
@stale
Copy link

stale bot commented Mar 24, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 24, 2023
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 29, 2023
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…nd tweaks. (ray-project#32744)

This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes.

Signed-off-by: elliottower <elliot@elliottower.com>
@stale
Copy link

stale bot commented May 1, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 1, 2023
@stale
Copy link

stale bot commented May 18, 2023

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants