Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed up MultiIndex.is_monotonic by 50x #27495

Merged
merged 1 commit into from
Jul 23, 2019

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Jul 20, 2019

The current logic for MultiIndex.is_monotonic relies on np.lexsort() on MultiIndex._values. While the result is cached, this is slow as it triggers the creation of ._values, needs to perform an O(n log(n)) sort, as well as populate the hashmap of a transient Index.

This PR significantly speeds up this check by directly operating on .codes when .levels are individually sorted. This means we can leverage libalgos.is_lexsorted() which is O(n) (but has the downside of needing int64 when MultiIndex compacts levels).

       before           after         ratio
     [5bd57f90]       [3320dded]
-      31.9±0.9ms         627±50μs     0.02  index_cached_properties.IndexCache.time_is_monotonic_decreasing('MultiIndex')
-      29.6±0.7ms         528±80μs     0.02  index_cached_properties.IndexCache.time_is_monotonic('MultiIndex')
-        29.9±1ms         524±90μs     0.02  index_cached_properties.IndexCache.time_is_monotonic_increasing('MultiIndex')

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note in Performance in 1.0 whatsnew

@@ -1359,6 +1359,9 @@ def is_monotonic_increasing(self):
increasing) values.
"""

if all([x.is_monotonic for x in self.levels]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what this short-circuits. how often is this hit (as compared to the following path)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course - it appears to almost always get hit, as it seems levels get sorted as part of .codes creation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, when is it not hit?

can you simply sort if its not hit then call the same code (and remove what's below)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having .levels be monotonic gives us the property that monotonicity in code-space is monotonic in level-space (as the values in a level are unique). If we sort a level to obtain monotonicity, we'd have to re-encode that level.

One case is when MultiIndex(codes=..., levels=...) is directly constructed and the levels are not sorted. It could be a win to re-encode here, but I'll need to do some testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, not a big deal; the reason is mainly that we can remove code :-> happy to merge this and can investigate that as a followup (or here ok too). lmk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this for now - I created a followup issue here: #27498

@jreback jreback added MultiIndex Performance Memory or execution speed performance labels Jul 20, 2019
@@ -1359,6 +1359,9 @@ def is_monotonic_increasing(self):
increasing) values.
"""

if all([x.is_monotonic for x in self.levels]):
return libalgos.is_lexsorted([x.astype("int64") for x in self.codes])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you maybe be able to add copy=False here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that nets another ~10-20% gain - thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

@jreback jreback added this to the 1.0 milestone Jul 21, 2019
@jreback jreback merged commit a39bcb5 into pandas-dev:master Jul 23, 2019
@jreback
Copy link
Contributor

jreback commented Jul 23, 2019

thanks @qwhelan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Join operation takes more time.
2 participants