PERF: get all partition widths/lengths in parallel instead of serially. #4494

mvashishtha · 2022-05-25T11:22:00Z

The reproducing script is the same as in #4493, but the solution here is different: instead of getting all lengths/widths serially, we should do so in parallel. We will need to add a new method at the physical layer for that.

This solution will not save us the cost of serializing the call queue drain result after computing length, but it will let us get all the lengths/widths at once instead of serially.

noloerino · 2022-07-12T00:46:56Z

I see this is already implemented for Dask by #4420. Is adding a similar implementation for PandasOnRay sufficient, or do we also need similar changes for PyarrowOnRay and OmnisciOnNative as well?

mvashishtha · 2022-07-13T00:11:10Z

@noloerino here are all the places I can find where we are getting uncached partition shapes serially. I found them by searching for width() in the Modin codebase.

PandasDataFrame._copartition for both dask and ray:
PandasDataFrame. _row_lengths for ray: note that dask is using its parallel implementation
PandasDataFrame._column_widths for ray note that dask dataframe is using its parallel implementation
in partition manager for both dask and ray
within dask virtual partitions
within ray virtual partitions

We do get lengths serially in rebalance_partitions, but that's okay because we check first that the lengths are cached.

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

mvashishtha · 2022-08-10T16:50:49Z

It's hard to find a case where this optimization is useful. See my comment here: #4683 (comment)

Given that this optimization doesn't seem to give any major gains, it doesn't seem to be worth the extra code complexity in #4683. I'll close this issue for now.

mvashishtha mentioned this issue May 25, 2022

PERF: Try width and length caches before materializing all partition lengths/widths in Modin frame #4493

Closed

mvashishtha added the Performance 🚀 Performance related issues and pull requests. label May 25, 2022

mvashishtha mentioned this issue Jun 21, 2022

FEAT-#4419: Extend virtual partitioning API to pandas on Dask #4420

Merged

8 tasks

noloerino self-assigned this Jul 11, 2022

noloerino mentioned this issue Jul 18, 2022

PERF-#4494: Get partition widths/lengths in parallel instead of serially #4683

Draft

8 tasks

noloerino added a commit to noloerino/modin that referenced this issue Aug 9, 2022

PERF-modin-project#4494: Get all partition widths/lengths in parallel

cb4f35c

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

mvashishtha closed this as completed Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: get all partition widths/lengths in parallel instead of serially. #4494

PERF: get all partition widths/lengths in parallel instead of serially. #4494

mvashishtha commented May 25, 2022

noloerino commented Jul 12, 2022

mvashishtha commented Jul 13, 2022

mvashishtha commented Aug 10, 2022

PERF: get all partition widths/lengths in parallel instead of serially. #4494

PERF: get all partition widths/lengths in parallel instead of serially. #4494

Comments

mvashishtha commented May 25, 2022

noloerino commented Jul 12, 2022

mvashishtha commented Jul 13, 2022

mvashishtha commented Aug 10, 2022