[Data] Optimizing the multi column groupby #45667
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the "get group boundaries" function in
groupby.map_groups()
. This is not a bottleneck so the overall runtime will likely have minimal improvement, but the function itself is optimized and the code is cleaner.Changes:
_MultiColumnSortedKey
, which I wrote a few months ago...), a numpy record array is constructed.np.where
In addition, I have cleaned some indexing and added docstrings and tests.
Why are these changes needed?
The pure
numpy
implementation is about 8x faster. The result below is for 100K elements. (similar results for 10M elements)The custom
_MultiColumnSortedKey
class was introduced as a work-around. The numpy recarray should be preferred as it allows vectorization.Time complexity / Speed up
Most of the speed up is coming from the use of recarray. This gives about 7x speed up.
A little more speed up was achieved when using
np.where
, although the time complexity changed from the current O( k log n) to O(n) with the new implementation (np.where
), where k is the number of groups and n is the length of the array. This is likely due to the vectorization (less branching).Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.