New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[REP] Execution Optimizer for Ray Datasets #19

Merged

zhe-thoughts merged 5 commits into main from optimizer

Jan 13, 2023

Contributor

c21 commented Dec 15, 2022

This REP introduces (1) lazy execution, (2) optimizer, and (3) vectorized execution with data batch, to improve user experience and performance for Ray Datasets.


          Execution Optimizer for Ray Datasets

366e584

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 assigned ericl, stephanie-wang, clarkzinzow, jianoaix and zhe-thoughts

c21 added 2 commits

December 15, 2022 01:10


          Remove extra white spaces

c03f39d

Signed-off-by: Cheng Su <scnju13@gmail.com>


          minor tweak

333f0ff

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 added the shepherding label

ericl reviewed

View reviewed changes

reps/2022-12-15-optimizer-data.md Outdated


		Architecture after REP:

		<img width="945" alt="new-architecture" src="https://user-images.githubusercontent.com/4629931/207807703-bb65db63-41a0-41d9-8e7b-154e1a0ed565.png">

Contributor

ericl Dec 15, 2022

Since we are considering re-optimization outside the scope of this REP, can we also remove that from the diagram?

Contributor Author

c21 Jan 9, 2023

yeah, removed.

reps/2022-12-15-optimizer-data.md Outdated


		#### 3.2.1. Interfaces

		NOTE: `OneToOneOperator` used here is the same as `OneToOneOperator` in "Native pipelining support in Ray Datasets" REP.

Contributor

ericl Dec 15, 2022

This section needs to be updated, since the other REP now only proposes PhysicalOperator.

Contributor Author

c21 Jan 9, 2023

@ericl - yeah updated. I need more thought to hook up BatchedOperator.process_batches with PhysicalOperator.add_input/inputs_done/has_next/get_next. But I think it should be implementation detail that we can figure it out later.

stephanie-wang requested changes

View reviewed changes

reps/2022-12-15-optimizer-data.md Show resolved Hide resolved

reps/2022-12-15-optimizer-data.md Outdated


		## Summary

		Build the breakthrough foundation to tackle a series of fundamental issues around Ray Data. The foundation is (1) lazy execution, (2) optimizer, and (3) vectorized execution with data batch.

Contributor

stephanie-wang Dec 15, 2022

The summary is a bit low-level right now and solution-heavy. It might be good to focus more on the problems (expensive and unnecessary materialization, current design lacks an optimizer which makes materialization impossible to elide).

Contributor Author

c21 Jan 9, 2023

Moved this under General Motivation to make it more coherent, as the top-level summary seem not strictly needed (not see in other REPs).

reps/2022-12-15-optimizer-data.md Show resolved Hide resolved

reps/2022-12-15-optimizer-data.md Show resolved Hide resolved

reps/2022-12-15-optimizer-data.md Show resolved Hide resolved

c21 mentioned this pull request

[Datasets] Enable lazy execution by default ray-project/ray#31286

Merged

13 tasks

ericl pushed a commit to ray-project/ray that referenced this pull request


          [Datasets] Enable lazy execution by default (#31286)

9cb9c0e

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes:
* Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed.
* `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block.
* Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing.

TODO:
- [x] Fix all unit tests
- [x] #31459
- [x] #31460 
- [ ] Remove the behavior to eagerly compute first block for read
- [ ] #31417
- [ ] Update documentation


          Address all comments

2045c55

Signed-off-by: Cheng Su <scnju13@gmail.com>

stephanie-wang approved these changes

View reviewed changes

zhe-thoughts reviewed

View reviewed changes

reps/2022-12-15-optimizer-data.md Outdated Show resolved Hide resolved

ericl added pending-committer-vote and removed shepherding labels


          Address comment of diagrams

bce6ec3

Signed-off-by: Cheng Su <scnju13@gmail.com>

AmeerHajAli pushed a commit to ray-project/ray that referenced this pull request


          [Datasets] Enable lazy execution by default (#31286)

b0357fd

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes:
* Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed.
* `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block.
* Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing.

TODO:
- [x] Fix all unit tests
- [x] #31459
- [x] #31460 
- [ ] Remove the behavior to eagerly compute first block for read
- [ ] #31417
- [ ] Update documentation

ericl added vote-approved and removed pending-committer-vote labels

zhe-thoughts merged commit 68b472b into main

c21 deleted the optimizer branch

January 13, 2023 22:26

tamohannes pushed a commit to ju2ez/ray that referenced this pull request


          [Datasets] Enable lazy execution by default (ray-project#31286)

7d66eff

This PR is to enable lazy execution by default. See ray-project/enhancements#19 for motivation. The change includes:
* Change `Dataset` constructor: `Dataset.__init__(lazy: bool = True)`. Also remove `defer_execution` field, as it's no longer needed.
* `read_api.py:read_datasource()` returns a lazy `Dataset` with computing the first input block.
* Add `ds.fully_executed()` calls to required unit tests, to make sure they are passing.

TODO:
- [x] Fix all unit tests
- [x] ray-project#31459
- [x] ray-project#31460 
- [ ] Remove the behavior to eagerly compute first block for read
- [ ] ray-project#31417
- [ ] Update documentation

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels