[Data] Optimizing the multi column groupby #45667

wingkitlee0 · 2024-06-01T17:43:04Z

This PR improves the "get group boundaries" function in groupby.map_groups(). This is not a bottleneck so the overall runtime will likely have minimal improvement, but the function itself is optimized and the code is cleaner.

Changes:

Instead of custom class implementation (_MultiColumnSortedKey, which I wrote a few months ago...), a numpy record array is constructed.
Changed from "np.searchsorted + while-loop" to np.where

In addition, I have cleaned some indexing and added docstrings and tests.

Why are these changes needed?

The pure numpy implementation is about 8x faster. The result below is for 100K elements. (similar results for 10M elements)

------------------------------------------------------------------------------------------------ benchmark: 2 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                    Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark[get_key_boundaries]               21.1100 (1.0)       25.5830 (1.0)       22.2432 (1.0)      1.0147 (1.0)       22.1691 (1.0)      0.8943 (1.0)           8;4  44.9575 (1.0)          30           1
test_benchmark[get_key_boundaries_original]     175.2830 (8.30)     178.6610 (6.98)     176.4963 (7.93)     1.4369 (1.42)     175.8079 (7.93)     2.5007 (2.80)          2;0   5.6658 (0.13)          6           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The custom _MultiColumnSortedKey class was introduced as a work-around. The numpy recarray should be preferred as it allows vectorization.

Time complexity / Speed up

Most of the speed up is coming from the use of recarray. This gives about 7x speed up.

A little more speed up was achieved when using np.where, although the time complexity changed from the current O( k log n) to O(n) with the new implementation (np.where), where k is the number of groups and n is the length of the array. This is likely due to the vectorization (less branching).

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Changed from a custom class implementation to a purely numpy implementation Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from fc8c604 to 2323159 Compare June 1, 2024 17:52

wingkitlee0 force-pushed the optimizing_multiple_group_key branch 6 times, most recently from 1245fc0 to 8e7f055 Compare June 20, 2024 00:04

wingkitlee0 marked this pull request as ready for review June 20, 2024 00:06

wingkitlee0 requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners June 20, 2024 00:06

anyscalesam added P2 Important issue, but not time-critical data Ray Data-related issues labels Sep 5, 2024

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from 8e7f055 to 2df45c7 Compare September 16, 2024 15:53

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from c05572b to 47bbfe8 Compare September 29, 2024 01:24

[Data] Optimizing the multi column groupby

b6076d6

Changed from a custom class implementation to a purely numpy implementation Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from e1f328a to b6076d6 Compare September 29, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Optimizing the multi column groupby #45667

[Data] Optimizing the multi column groupby #45667

wingkitlee0 commented Jun 1, 2024 •

edited

Loading

[Data] Optimizing the multi column groupby #45667

Are you sure you want to change the base?

[Data] Optimizing the multi column groupby #45667

Conversation

wingkitlee0 commented Jun 1, 2024 • edited Loading

Why are these changes needed?

Time complexity / Speed up

Related issue number

Checks

wingkitlee0 commented Jun 1, 2024 •

edited

Loading