Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] [Operator Fusion - 2/N] Data layer performance/bug fixes and tweaks. #32744

Merged

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Feb 22, 2023

This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in #32178, along with some minor changes. These fixes/changes include:

  1. Logical plan not being propagated in ds.lazy() call (this isn't a perf fix, just a regular ol' bug fix): 8591ac6
  2. DatasetContext is generically set via the cached_remote_fn wrapper, reducing redundant code DatasetContext may be modified after the remote function has been cached (e.g. when reading a CSV dataset twice), so we still need to pass through the DatasetContext at task submission time.
  3. Don't use Pandas OptionContext to ignore chained assignment warnings, since this is surprisingly expensive; use normal warnings filter instead: 8bcb07e
  4. Misc. perf improvements: reduce redundant copies during batching, defer size estimation until it's actually needed, perform simple block size estimation in batches.

The performance optimizations produce outsized improvements on the zero-copy adapters benchmark:

  • After adding SizeEstimator.add_block() that applies the same size estimation logic (add to the weighted running mean every N rows) to blocks instead of individual rows, I saw a 10x perf improvement for a simple blocks benchmark on 1k blocks and 10k rows per block.
  • Deferring size accumulation in the tabular BlockBuilders until it’s actually asked for by a wrapping BlockOutputBuffer (using a calculated size cursor and a running sum that’s updated on each get_estimated_memory_usage() call), I’m able to get a 2x perf improvement for a Pandas batching benchmark on 1k blocks, 10k rows per block, and 2k rows per batch (and this is a 3x improvement compared to the legacy operator fusion).
  • Changing our disabling of the chained assignment warning from our tensor extension casting with the Pandas OptionContext to use a normal warnings.simplefilter 2xed the performance of a benchmark that fuses 2 consecutive MapBatches operations.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -55,6 +55,11 @@ def add_block(self, block: Block):
self._builder = accessor.builder()
self._builder.add_block(block)

def will_build_yield_copy(self) -> bool:
if self._builder is None:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe False by default?

Copy link
Contributor Author

@clarkzinzow clarkzinzow Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl we'll technically create a new (empty) block in this case, which I think we should consider to be a "copy" in the sense that the returned block doesn't point to any old data buffers (this method is returning whether building will yield a new block, not whether building will copy data). The Batcher currently uses this method to determine whether we need to copy the built block in order to ensure that no old data buffers are still being referenced, so we can respect the zero_copy_batch=False, ensure_copy=True case.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 22, 2023
@ericl
Copy link
Contributor

ericl commented Feb 22, 2023

After adding SizeEstimator.add_block() that applies the same size estimation logic (add to the weighted running mean every N rows) to blocks instead of individual rows, I saw a 10x perf improvement for a simple blocks benchmark on 1k blocks and 10k rows per block.

I'm a bit confused why this one improved perf actually, isn't the new code just refactored and does the same equivalent sampling?

@clarkzinzow
Copy link
Contributor Author

clarkzinzow commented Feb 23, 2023

I'm a bit confused why this one improved perf actually, isn't the new code just refactored and does the same equivalent sampling?

Yep, it does the equivalent sampling! IIRC most of the overhead was going through the size estimation for every row in the block, rather than a single batched operation for the block. When adding a simple block containing 10k rows, this is the difference between doing 10k func calls (with probably a lot of branch mispredictions) and a total of 10 + 10 + 10 = 30 size estimations (pickle roundtrips), vs a single func call with 10 size estimations.

@clarkzinzow clarkzinzow merged commit b081dd2 into ray-project:master Feb 23, 2023
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…nd tweaks. (ray-project#32744)

This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Mar 22, 2023
…nd tweaks. (ray-project#32744)

This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes.
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…nd tweaks. (ray-project#32744)

This PR contains some miscellaneous performance/bug fixes discovered while benchmarking the zero-copy adapters in ray-project#32178, along with some minor changes.

Signed-off-by: elliottower <elliot@elliottower.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants