Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Change default for Index.union sort #25007

Closed

Conversation

TomAugspurger
Copy link
Contributor

Closes #24959

Haven't done MultiIndex yet, just opening for discussion on if we should do this for 0.24.1.


.. versionadded:: 0.24.0

.. versionchanged:: 0.24.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "0.24.1"

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. We should then do the same for intersection and difference?

@TomAugspurger
Copy link
Contributor Author

We should then do the same for intersection and difference?

Not intersection. Probably difference and symmetric_difference.

This looks good to me.

Just to confirm for posterity, good for 0.24.1? If so I'll get back to it in ~4 hours to finish things off.

@jorisvandenbossche
Copy link
Member

Not intersection.

intersection also has the try/except on sorting. So although the default there is sort=False, we still might want to have "correct" behaviour for sort=True/None ?

Just to confirm for posterity, good for 0.24.1? If so I'll get back to it in ~4 hours to finish things off.

For me that would be the best option, but we might want confirmation of others.

@TomAugspurger
Copy link
Contributor Author

intersection also has the try/except on sorting.

Ah yes, I was thinking ahead to the deprecation.

But, do we think that .intersection(sort=None) is useful? Or will we eventually deprecate it anyway?


.. versionadded:: 0.24.0

.. versionchanged:: 0.24.0

Changed the default `sort` to None, matching the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this being changed? this is certainly not a regression at all. This was the default behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: no behaviour is changed. It was indeed the default, it stays the default. It's only the value that encodes the default that is changed (True -> None), so that True can mean something else (=always sort).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, maybe it should be more clear in the doc-string

@TomAugspurger
Copy link
Contributor Author

intersection also has the try/except on sorting. So although the default there is sort=False, we still might want to have "correct" behaviour for sort=True/None ?

I don't see the try/except around sorting for Index.intersection.

In [11]: a.intersection(a[::-1], sort=True)
Traceback (most recent call last):
  File "<ipython-input-11-2e1c550543d3>", line 1, in <module>
    a.intersection(a[::-1], sort=True)
  File "/Users/taugspurger/sandbox/pandas/pandas/core/indexes/base.py", line 2431, in intersection
    taken = sorting.safe_sort(taken.values)
  File "/Users/taugspurger/sandbox/pandas/pandas/core/sorting.py", line 459, in safe_sort
    ordered = sort_mixed(values)
  File "/Users/taugspurger/sandbox/pandas/pandas/core/sorting.py", line 452, in sort_mixed
    nums = np.sort(values[~str_pos])
  File "/Users/taugspurger/Envs/pandas-dev/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 934, in sort
    a.sort(axis=axis, kind=kind, order=order)
  File "pandas/_libs/tslibs/timestamps.pyx", line 258, in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__
    raise TypeError('Cannot compare type %r with type %r' %
TypeError: Cannot compare type 'Timestamp' with type 'int'


In [12]: a.intersection(a[::-1], sort=False)
Out[12]: Index([1, 2000-01-01 00:00:00], dtype='object')

I do see the special casing when the indexes are equal

In [16]: a = pd.Index(['b', 'a'])

In [17]: a.intersection(a, sort=True)
Out[17]: Index(['b', 'a'], dtype='object')

@TomAugspurger
Copy link
Contributor Author

Following up on
#25007 (comment), IMO Index.intersection(sort=None) is not needed. I believe the two behaviors enabled by sort=None are

  1. No sort when idx.equals(idx).
  2. No exception when elements of self & other are not comparable.

The first role is served equally well by idx.intersection(other, sort=False)

In [5]: a = pd.Index(['b', 'a'])

In [6]: a.intersection(a, sort=False)
Out[6]: Index(['b', 'a'], dtype='object')

The second, I don't think we should be adding that behavior where it wasn't previously. And since Index.intersection didn't previously sort, we don't need to add it.

@codecov
Copy link

codecov bot commented Jan 30, 2019

Codecov Report

Merging #25007 into master will decrease coverage by 49.5%.
The diff coverage is 41.66%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25007       +/-   ##
===========================================
- Coverage   92.38%   42.88%   -49.51%     
===========================================
  Files         166      166               
  Lines       52401    52409        +8     
===========================================
- Hits        48410    22474    -25936     
- Misses       3991    29935    +25944
Flag Coverage Δ
#multiple ?
#single 42.88% <41.66%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 56.35% <41.66%> (-39.98%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ece58cb...2a2de25. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 30, 2019

Codecov Report

Merging #25007 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25007      +/-   ##
==========================================
+ Coverage   92.37%   92.38%   +<.01%     
==========================================
  Files         166      166              
  Lines       52397    52416      +19     
==========================================
+ Hits        48404    48425      +21     
+ Misses       3993     3991       -2
Flag Coverage Δ
#multiple 90.81% <100%> (ø) ⬆️
#single 42.86% <28.57%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 95.6% <100%> (+0.01%) ⬆️
pandas/core/indexes/base.py 96.34% <100%> (+0.02%) ⬆️
pandas/core/frame.py 96.92% <0%> (ø) ⬆️
pandas/core/tools/numeric.py 100% <0%> (ø) ⬆️
pandas/core/internals/concat.py 96.45% <0%> (ø) ⬆️
pandas/core/internals/blocks.py 94.26% <0%> (+0.08%) ⬆️
pandas/util/testing.py 88.13% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 149138e...b15dc7e. Read the comment docs.

@shoyer
Copy link
Member

shoyer commented Jan 30, 2019

Following up on
#25007 (comment), IMO Index.intersection(sort=None) is not needed.

Agreed!

@jorisvandenbossche jorisvandenbossche added this to the 0.24.1 milestone Jan 30, 2019
@jorisvandenbossche
Copy link
Member

I don't see the try/except around sorting for Index.intersection.

Sorry, not sure what I was looking at .. :) It does have the "directly return if equals" behaviour that needs to handle sort, but that you already did.

Fully agreed a sort=None with new behaviour is not needed for intersection !

doc/source/whatsnew/v0.24.1.rst Outdated Show resolved Hide resolved

When ``sort=True`` is provided to :meth:`Index.intersection`, the values are always sorted. In 0.24.0,
the values would not be sorted when ``self`` and ``other`` were identical. Pass ``sort=False`` to not
sort the values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make it clearer that this was new behaviour in 0.24.0, and that no behaviour changed compared to what you could do on 0.23 ?

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When ``sort=True`` is provided to :meth:`Index.intersection`, the values are always sorted. In 0.24.0,
the values would not be sorted when ``self`` and ``other`` were identical. Pass ``sort=False`` to not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am -1 on this change. We do NOT do this elsewhere, e.g. .reindex, so this is extra useless sorting. (basically cases 1 and 2 above). I am not sure of the utility of 3 at all. We cannot guarantee sorting, showing a warning is fine ; this has been this way since pandas inception. I don't see any utility in changing this.

@jorisvandenbossche
Copy link
Member

We do NOT do this elsewhere, e.g. .reindex, so this is extra useless sorting.

@jreback can you clarify this comparison to reindex? Reindexing does not involve any sorting?

@jreback
Copy link
Contributor

jreback commented Jan 30, 2019

@jorisvandenbossche .reindex does not reindex if the other is identical (or empty). This is the same exact situation

@jorisvandenbossche
Copy link
Member

.reindex does not reindex if the other is identical (or empty). This is the same exact situation

Yes, but reindex has no sort keyword, and does not sort in general. So here it's a different situation IMO.
Once the default would be sort=False, then Index.union will also not do anything by default if the other is identical.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2019

Yes, but reindex has no sort keyword, and does not sort in general. So here it's a different situation IMO.
Once the default would be sort=False, then Index.union will also not do anything by default if the other is identical.

not at all. this is the same. changing semantics like this is simply not warranted.

@jreback jreback removed this from the 0.24.1 milestone Jan 30, 2019
@jreback
Copy link
Contributor

jreback commented Jan 30, 2019

Changes like this need to sit in master. I am -1 on doing this for 0.24.x at all. There is no reason to change at the last minute like this.

@TomAugspurger TomAugspurger changed the title [WIP]: API: Change default for Index.union sort API: Change default for Index.union sort Jan 31, 2019
@TomAugspurger
Copy link
Contributor Author

Did the same change for symmdiff, so that sort=True means sort.

@jorisvandenbossche
Copy link
Member

Did the same change for symmdiff, so that sort=True means sort.

The same also needs to be done for difference ?

@TomAugspurger
Copy link
Contributor Author

Sorry, yes...

@jreback
Copy link
Contributor

jreback commented Jan 31, 2019

I have not wavered on this and am -1

I see no reason to not simply do the change in 0.25

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 31, 2019

I see no reason to not simply do the change in 0.25

Because then we would need to deprecate the sort=True keyword we now introduced in 0.24.0. By quickly changing it, our goal is to avoid this deprecation.

@jreback can you please be more specific on what you are -1? Which of the following things do you object:

  • the long-term goal to have sort=False the default (meaning we would eg deprecate the default in 0.25.0)
  • the long-term goal that sort=True should mean "always sorting", even when the passed data is equal or empty
  • changing the default value from True to None (to preserve True for actually always sorting)
  • adding new behaviour to correspond to sort=True

@jorisvandenbossche
Copy link
Member

(would you be able to come to gitter? that might be easier to try to come to an agreement)

@TomAugspurger
Copy link
Contributor Author

I have not wavered on this and am -1

No! From

I am ok with sort=None if the ultimate goal is a deprecation so we can reach sort=False as a default.

I thought you were OK with it 😢

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 31, 2019

I really, really, really, think we should be doing this soon.


FYI, MultiIndex.difference was different from Index.difference in two ways

  1. It short-circuited when other was empty (no sorting).
  2. It didn't do a try / except around sorting.

(haven't fixed this yet).

@jreback
Copy link
Contributor

jreback commented Jan 31, 2019

adding new behaviour to correspond to sort=True

this is the problem

@jreback
Copy link
Contributor

jreback commented Jan 31, 2019

For the record I am -1 on this change as is, but @jorisvandenbossche and @TomAugspurger are going ahead.

@TomAugspurger
Copy link
Contributor Author

Since MultiIndex.difference didn't have the silent "don't sort if not possible behavior", I haven't chosen to implement it yet.

# 0.23.4
In [25]: a = pd.MultiIndex.from_product([[1, pd.Timestamp('2000'), 0], [1, 2]])

In [26]: a.difference([(0, 1)])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-d5517c8a9130> in <module>
----> 1 a.difference([(0, 1)])

~/miniconda3/envs/pandas-0.24.0/lib/python3.7/site-packages/pandas/core/indexes/multi.py in difference(self, other, sort)
   2984         difference = this.values.take(label_diff)
   2985         if sort:
-> 2986             difference = sorted(difference)
   2987
   2988         if len(difference) == 0:

pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__()

TypeError: Cannot compare type 'Timestamp' with type 'int'

do people have thoughts on that?

@TomAugspurger
Copy link
Contributor Author

For the record I am -1 on this change as is, but @jorisvandenbossche and @TomAugspurger are going ahead.

@jreback not if you're -1. Pandas operates by consensus.

@jreback
Copy link
Contributor

jreback commented Jan 31, 2019

@TomAugspurger if I am -1 then there is clearly not consensus, as my concerns as laid out have not been addressed. This is way to many and too fast changes for a minor release.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 31, 2019

@jreback concrete proposal:

  • for 0.24.1, we do:

    • change the default value of sort=True to sort=None (of course without any change in behaviour, and also without adding new behaviour)
    • disallow sort=True (just following from only having sort=None|False as options), which should not be a problem since nobody should already rely on this option (only introduced in 0.24.0 and it is the default anyway)
  • for 0.25.0, we can then:

    • further discuss what we want for sort=True (the same behaviour as sort=None (what you want), or the "always sorting" behaviour (what Tom, Stephan and I prefer))
    • further discuss about deprecation towards sort=False as default

What do you think about this? It would allow us to go forward with releasing 0.24.1today without needing to resolve the full discussion today.

And if not OK, can you please be specific what in there does not address your concerns?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 31, 2019 via email

@jreback
Copy link
Contributor

jreback commented Jan 31, 2019

@jorisvandenbossche your proposal is fine.

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche do you have time to implement
#25007 (comment)?

@jorisvandenbossche
Copy link
Member

@TomAugspurger yes, will do that

@TomAugspurger
Copy link
Contributor Author

Opened #25151 for the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants