Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Optimizing the multi column groupby #45667

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wingkitlee0
Copy link
Contributor

@wingkitlee0 wingkitlee0 commented Jun 1, 2024

This PR improves the "get group boundaries" function in groupby.map_groups(). This is not a bottleneck so the overall runtime will likely have minimal improvement, but the function itself is optimized and the code is cleaner.

Changes:

  • Instead of custom class implementation (_MultiColumnSortedKey, which I wrote a few months ago...), a numpy record array is constructed.
  • Changed from "np.searchsorted + while-loop" to np.where

In addition, I have cleaned some indexing and added docstrings and tests.

Why are these changes needed?

The pure numpy implementation is about 8x faster. The result below is for 100K elements. (similar results for 10M elements)

------------------------------------------------------------------------------------------------ benchmark: 2 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                    Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark[get_key_boundaries]               21.1100 (1.0)       25.5830 (1.0)       22.2432 (1.0)      1.0147 (1.0)       22.1691 (1.0)      0.8943 (1.0)           8;4  44.9575 (1.0)          30           1
test_benchmark[get_key_boundaries_original]     175.2830 (8.30)     178.6610 (6.98)     176.4963 (7.93)     1.4369 (1.42)     175.8079 (7.93)     2.5007 (2.80)          2;0   5.6658 (0.13)          6           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The custom _MultiColumnSortedKey class was introduced as a work-around. The numpy recarray should be preferred as it allows vectorization.

Time complexity / Speed up

Most of the speed up is coming from the use of recarray. This gives about 7x speed up.

A little more speed up was achieved when using np.where, although the time complexity changed from the current O( k log n) to O(n) with the new implementation (np.where), where k is the number of groups and n is the length of the array. This is likely due to the vectorization (less branching).

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch from fc8c604 to 2323159 Compare June 1, 2024 17:52
@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch 6 times, most recently from 1245fc0 to 8e7f055 Compare June 20, 2024 00:04
@wingkitlee0 wingkitlee0 marked this pull request as ready for review June 20, 2024 00:06
@anyscalesam anyscalesam added P2 Important issue, but not time-critical data Ray Data-related issues labels Sep 5, 2024
Changed from a custom class implementation to a purely numpy implementation

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants