-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: speed up MultiIndex.is_monotonic by 50x #27495
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a note in Performance in 1.0 whatsnew
pandas/core/indexes/multi.py
Outdated
@@ -1359,6 +1359,9 @@ def is_monotonic_increasing(self): | |||
increasing) values. | |||
""" | |||
|
|||
if all([x.is_monotonic for x in self.levels]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on what this short-circuits. how often is this hit (as compared to the following path)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course - it appears to almost always get hit, as it seems levels get sorted as part of .codes
creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, when is it not hit?
can you simply sort if its not hit then call the same code (and remove what's below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having .levels
be monotonic gives us the property that monotonicity in code-space is monotonic in level-space (as the values in a level are unique). If we sort a level to obtain monotonicity, we'd have to re-encode that level.
One case is when MultiIndex(codes=..., levels=...)
is directly constructed and the levels are not sorted. It could be a win to re-encode here, but I'll need to do some testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, not a big deal; the reason is mainly that we can remove code :-> happy to merge this and can investigate that as a followup (or here ok too). lmk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this for now - I created a followup issue here: #27498
pandas/core/indexes/multi.py
Outdated
@@ -1359,6 +1359,9 @@ def is_monotonic_increasing(self): | |||
increasing) values. | |||
""" | |||
|
|||
if all([x.is_monotonic for x in self.levels]): | |||
return libalgos.is_lexsorted([x.astype("int64") for x in self.codes]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you maybe be able to add copy=False here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that nets another ~10-20% gain - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
9b3e84d
to
b083fee
Compare
thanks @qwhelan |
The current logic for
MultiIndex.is_monotonic
relies onnp.lexsort()
onMultiIndex._values
. While the result is cached, this is slow as it triggers the creation of._values
, needs to perform anO(n log(n))
sort, as well as populate the hashmap of a transientIndex
.This PR significantly speeds up this check by directly operating on
.codes
when.levels
are individually sorted. This means we can leveragelibalgos.is_lexsorted()
which isO(n)
(but has the downside of needingint64
whenMultiIndex
compacts levels).black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff