-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add sort parameter to set operations for some Indexes and adjust… #24521
ENH: Add sort parameter to set operations for some Indexes and adjust… #24521
Conversation
@@ -603,7 +603,7 @@ def _wrap_setop_result(self, other, result): | |||
raise ValueError('Passed item and index have different timezone') | |||
return self._shallow_copy(result, name=name, freq=None, tz=self.tz) | |||
|
|||
def intersection(self, other): | |||
def intersection(self, other, sort=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sort
parameter is not fully implemented for intersection
on DatetimeIndex
yet. I just put this in for now to fix some failing tests
Codecov Report
@@ Coverage Diff @@
## master #24521 +/- ##
==========================================
+ Coverage 31.88% 31.9% +0.01%
==========================================
Files 166 166
Lines 52434 52444 +10
==========================================
+ Hits 16718 16730 +12
+ Misses 35716 35714 -2
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24521 +/- ##
==========================================
- Coverage 92.38% 92.38% -0.01%
==========================================
Files 166 166
Lines 52382 52392 +10
==========================================
+ Hits 48395 48404 +9
- Misses 3987 3988 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good.
pandas/core/indexes/base.py
Outdated
try: | ||
taken = taken.sort_values() | ||
except TypeError as e: | ||
warnings.warn("%s, sort order is undefined for " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this actually get hit? I really hate all of these sort warnings. maybe can you factor them out in a helper function instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use .format here
so we can followup up on the 2 cases you mentioned at the top of the PR. pls roll them into an issue (you can expand with check boxes if more clear). |
f3b517a
to
f393803
Compare
elif how == 'outer': | ||
join_index = self.union(other, sort=sort) | ||
join_index = self.union(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Master union
does generally sort by default so I have left the default sort=True
here for now for compatibility reasons so the current behaviour of join(..., how='outer')
does not change for now and all of the tests expecting the results of this type of join to be sorted don't break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a TODO
Updated the original issue (#24471) with some of the corner cases I mention above. I will try to address these in subsequent PRs once this PR is finished. |
@@ -4473,7 +4473,7 @@ def _reindex_axis(obj, axis, labels, other=None): | |||
|
|||
labels = ensure_index(labels.unique()) | |||
if other is not None: | |||
labels = ensure_index(other.unique()) & labels | |||
labels = ensure_index(other.unique()).intersection(labels, sort=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, on Master intersection
does not generally sort by default so I have added sort=False
here for compatibility
pandas/tests/indexes/test_base.py
Outdated
@@ -2359,12 +2359,14 @@ def test_intersection_base(self, sort): | |||
if PY3 and sort: | |||
# unorderable types | |||
warn_type = RuntimeWarning | |||
expected = Index([0, 'a', 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment Index.intersection
does not use sorting.safe_sort
which tries to sort arrays of mixed type (int and str) by ordering int before str in Python 3. I could try to change this either in this PR or in a separate one if this is getting too big already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to use sort_safe
in Index.intersection
in this PR in the next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments. ping on green.
pass | ||
if sort: | ||
try: | ||
uniques.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might see if using safe_sort makes sense here (you would have to import here, otherwise this would e circular)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using safe_sort
here but it was causing some problems. The issue seems to be that uniques
here is a list of tuples which are then used to construct a MultiIndex
in MultiIndex.union
. However, when we use safe_sort
it turns uniques
into an np.array
and doesn't sort it correctly so we get the wrong results. I can have a closer look at trying to resolve this if you want but it might involve changing safe_sort
a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's followup up later then (new PR).
pandas/core/indexes/base.py
Outdated
if sort: | ||
try: | ||
taken = sorting.safe_sort(taken.values) | ||
if self.name != other.name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
_get_reconciled_name_object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using get_reconciled_name_object
but from the docstring it says that it's used when the "result of the set operation will be self, return self, unless the name changes, in which case make a shallow copy of self". But the result of the set operation is not self
in this case, I think.
pandas/core/indexes/base.py
Outdated
try: | ||
taken = taken.sort_values() | ||
except TypeError as e: | ||
warnings.warn("%s, sort order is undefined for " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
elif how == 'outer': | ||
join_index = self.union(other, sort=sort) | ||
join_index = self.union(other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a TODO
pass | ||
if sort: | ||
try: | ||
uniques.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
pandas/core/indexes/base.py
Outdated
result = sorting.safe_sort(result) | ||
except TypeError as e: | ||
warnings.warn("%s, sort order is undefined for " | ||
"incomparable objects" % e, RuntimeWarning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use .format() here
pandas/core/indexes/base.py
Outdated
try: | ||
taken = taken.sort_values() | ||
except TypeError as e: | ||
warnings.warn("%s, sort order is undefined for " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use .format here
d9e68be
to
9549661
Compare
Hello @reidy-p! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 19, 2019 at 18:22 Hours UTC |
9549661
to
7b15248
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of comments, but let's do in a followup if you can (otherwise pls make an issue).
pass | ||
if sort: | ||
try: | ||
uniques.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's followup up later then (new PR).
@@ -2385,8 +2393,18 @@ def intersection(self, other): | |||
indexer = indexer[indexer != -1] | |||
|
|||
taken = other.take(indexer) | |||
|
|||
if sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I still think we can use _get_reconcile_name_object
here (maybe need to wrap taken
in Index
), but can do in a followup.
thanks @reidy-p |
@jreback thanks. I hope to do a few follow-up PRs on this. |
Was it intended to change the default behavior of |
all of the set ops always sorted; the parameter now allows u to turn it off (default should be the same as before) |
This is definitely changed behavior from 0.23.x for intersection (it turned up in a failure in xarray's test suite), but I'll raise another issue about that. |
See #24959 for discussion about the behavior change. |
… tests
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR makes some progress towards adding a
sort
parameter with a default ofTrue
to the set operations (union
,intersection
,difference
andsymmetric_difference
) onIndex
classes. Adding this parameter into every type ofIndex
would result in a very big PR so I have decided to break it up and try to add it in stages. I have tried to focus onIndex
,MultiIndex
,PeriodIndex
andIntervalndex
in this PR but have made some very small changes to set operations on other indices where necessary.Some issues to consider:
Index
types. For example, because of the way some of the set operations are implemented on someIndex
types the results may always be sorted even ifsort=False
is passed (e.g., aunion
operation onDatetimeIndex
s may always return a sorted result in some cases even ifsort=False
). Perhaps this is not really a problem as long as we document it.sort
parameter at the moment. For example, theintersection
of an unsortedIndex
with itself will return the original unsortedIndex
even ifsort=True
is passed because thesort
parameter is simply ignored in this case. Similarly, theintersection
of an unsortedIndex
with an emptyIndex
will also return the original unsortedIndex
even ifsort=True
. There is similar behaviour for the other set operations.