Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Change default for Index.union sort #25007

Closed
78 changes: 76 additions & 2 deletions doc/source/whatsnew/v0.24.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,84 @@ Whats New in 0.24.1 (February XX, 2019)
These are the changes in pandas 0.24.1. See :ref:`release` for a full changelog
including other versions of pandas.

.. _whatsnew_0241.api:

API Changes
~~~~~~~~~~~

Changing the ``sort`` parameter for :meth:`Index.union`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The default ``sort`` value for :meth:`Index.union` has changed from ``True`` to ``None`` (:issue:`24959`).
The default *behavior* remains the same: The result is sorted, unless

1. ``self`` and ``other`` are identical
2. ``self`` or ``other`` is empty
3. ``self`` or ``other`` contain values that can not be compared (a ``RuntimeWarning`` is raised).

This allows ``sort=True`` to now mean "always sort". A ``TypeError`` is raised if the values cannot be compared.

**Behavior in 0.24.0**

.. ipython:: python

In [1]: idx = pd.Index(['b', 'a'])

In [2]: idx.union(idx) # sort=True was the default.
Out[2]: Index(['b', 'a'], dtype='object')

In [3]: idx.union(idx, sort=True) # result is still not sorted.
Out[32]: Index(['b', 'a'], dtype='object')

**New Behavior**

.. ipython:: python

idx = pd.Index(['b', 'a'])
idx.union(idx) # sort=None is the default. Don't sort identical operands.

idx.union(idx, sort=True)

The same change applies to :meth:`Index.difference` and :meth:`Index.symmetric_difference`, which
would previously not sort the result when ``sort=True`` but the values could not be compared.

Changed the behavior of :meth:`Index.intersection` with ``sort=True``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When ``sort=True`` is provided to :meth:`Index.intersection`, the values are always sorted. In 0.24.0,
the values would not be sorted when ``self`` and ``other`` were identical. Pass ``sort=False`` to not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am -1 on this change. We do NOT do this elsewhere, e.g. .reindex, so this is extra useless sorting. (basically cases 1 and 2 above). I am not sure of the utility of 3 at all. We cannot guarantee sorting, showing a warning is fine ; this has been this way since pandas inception. I don't see any utility in changing this.

sort the values. This matches the behavior of pandas 0.23.4 and earlier.

**Behavior in 0.23.4**

.. ipython:: python

In [2]: idx = pd.Index(['b', 'a'])

In [3]: idx.intersection(idx) # sort was not a keyword.
Out[3]: Index(['b', 'a'], dtype='object')

**Behavior in 0.24.0**

.. ipython:: python

In [5]: idx.intersection(idx) # sort=True by default. Don't sort identical.
Out[5]: Index(['b', 'a'], dtype='object')

In [6]: idx.intersection(idx, sort=True)
Out[6]: Index(['b', 'a'], dtype='object')

**New Behavior**

.. ipython:: python

idx.intersection(idx) # sort=False by default
idx.intersection(idx, sort=True)

.. _whatsnew_0241.regressions:

Fixed Regressions
^^^^^^^^^^^^^^^^^
~~~~~~~~~~~~~~~~~

- Bug in :meth:`DataFrame.itertuples` with ``records`` orient raising an ``AttributeError`` when the ``DataFrame`` contained more than 255 columns (:issue:`24939`)
- Bug in :meth:`DataFrame.itertuples` orient converting integer column names to strings prepended with an underscore (:issue:`24940`)
Expand All @@ -28,7 +102,7 @@ Fixed Regressions
.. _whatsnew_0241.enhancements:

Enhancements
^^^^^^^^^^^^
~~~~~~~~~~~~


.. _whatsnew_0241.bug_fixes:
Expand Down
5 changes: 4 additions & 1 deletion pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -233,11 +233,14 @@ def fast_unique_multiple(list arrays, sort: bool=True):
if val not in table:
table[val] = stub
uniques.append(val)
if sort:
if sort is None:
try:
uniques.sort()
except Exception:
# TODO: RuntimeWarning?
pass
elif sort:
uniques.sort()

return uniques

Expand Down
96 changes: 79 additions & 17 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2245,18 +2245,34 @@ def _get_reconciled_name_object(self, other):
return self._shallow_copy(name=name)
return self

def union(self, other, sort=True):
def union(self, other, sort=None):
"""
Form the union of two Index objects.

Parameters
----------
other : Index or array-like
sort : bool, default True
Sort the resulting index if possible
sort : bool or None, default None
Whether to sort the resulting Index.

* None : Sort the result, except when

1. `self` and `other` are equal.
2. `self` or `other` has length 0.
3. Some values in `self` or `other` cannot be compared.
A RuntimeWarning is issued in this case.

* True : sort the result. A TypeError is raised when the
values cannot be compared.
* False : do not sort the result.

.. versionadded:: 0.24.0

.. versionchanged:: 0.24.1

Changed the default `sort` to None, matching the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this being changed? this is certainly not a regression at all. This was the default behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: no behaviour is changed. It was indeed the default, it stays the default. It's only the value that encodes the default that is changed (True -> None), so that True can mean something else (=always sort).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, maybe it should be more clear in the doc-string

behavior of pandas 0.23.4 and earlier.

Returns
-------
union : Index
Expand All @@ -2273,10 +2289,16 @@ def union(self, other, sort=True):
other = ensure_index(other)

if len(other) == 0 or self.equals(other):
return self._get_reconciled_name_object(other)
result = self._get_reconciled_name_object(other)
if sort:
result = result.sort_values()
return result

if len(self) == 0:
return other._get_reconciled_name_object(self)
result = other._get_reconciled_name_object(self)
if sort:
result = result.sort_values()
return result

# TODO: is_dtype_union_equal is a hack around
# 1. buggy set ops with duplicates (GH #13432)
Expand Down Expand Up @@ -2319,13 +2341,16 @@ def union(self, other, sort=True):
else:
result = lvals

if sort:
if sort is None:
try:
result = sorting.safe_sort(result)
except TypeError as e:
warnings.warn("{}, sort order is undefined for "
"incomparable objects".format(e),
RuntimeWarning, stacklevel=3)
elif sort:
# raise if not sortable.
result = sorting.safe_sort(result)

# for subclasses
return self._wrap_setop_result(other, result)
Expand All @@ -2342,8 +2367,12 @@ def intersection(self, other, sort=False):
Parameters
----------
other : Index or array-like
sort : bool, default False
Sort the resulting index if possible
sort : bool or None, default False
Whether to sort the resulting index.

* False : do not sort the result.
* True : sort the result. A TypeError is raised when the
values cannot be compared.

.. versionadded:: 0.24.0

Expand All @@ -2367,7 +2396,10 @@ def intersection(self, other, sort=False):
other = ensure_index(other)

if self.equals(other):
return self._get_reconciled_name_object(other)
result = self._get_reconciled_name_object(other)
if sort:
result = result.sort_values()
return result

if not is_dtype_equal(self.dtype, other.dtype):
this = self.astype('O')
Expand Down Expand Up @@ -2415,7 +2447,7 @@ def intersection(self, other, sort=False):

return taken

def difference(self, other, sort=True):
def difference(self, other, sort=None):
"""
Return a new Index with elements from the index that are not in
`other`.
Expand All @@ -2425,11 +2457,24 @@ def difference(self, other, sort=True):
Parameters
----------
other : Index or array-like
sort : bool, default True
Sort the resulting index if possible
sort : bool or None, default None
Whether to sort the resulting index. By default, the
values are attempted to be sorted, but any TypeError from
incomparable elements is caught by pandas.

* None : Attempt to sort the result, but catch any TypeErrors
from comparing incomparable elements.
* False : Do not sort the result.
* True : Sort the result, raising a TypeError if any elements
cannot be compared.

.. versionadded:: 0.24.0

.. versionchanged:: 0.24.1

Added the `None` option, which matches the behavior of
pandas 0.23.4 and earlier.

Returns
-------
difference : Index
Expand Down Expand Up @@ -2460,27 +2505,42 @@ def difference(self, other, sort=True):
label_diff = np.setdiff1d(np.arange(this.size), indexer,
assume_unique=True)
the_diff = this.values.take(label_diff)
if sort:
if sort is None:
try:
the_diff = sorting.safe_sort(the_diff)
except TypeError:
pass
elif sort:
the_diff = sorting.safe_sort(the_diff)

return this._shallow_copy(the_diff, name=result_name, freq=None)

def symmetric_difference(self, other, result_name=None, sort=True):
def symmetric_difference(self, other, result_name=None, sort=None):
"""
Compute the symmetric difference of two Index objects.

Parameters
----------
other : Index or array-like
result_name : str
sort : bool, default True
Sort the resulting index if possible
sort : bool or None, default None
Whether to sort the resulting index. By default, the
values are attempted to be sorted, but any TypeError from
incomparable elements is caught by pandas.

* None : Attempt to sort the result, but catch any TypeErrors
from comparing incomparable elements.
* False : Do not sort the result.
* True : Sort the result, raising a TypeError if any elements
cannot be compared.

.. versionadded:: 0.24.0

.. versionchanged:: 0.24.1

Added the `None` option, which matches the behavior of
pandas 0.23.4 and earlier.

Returns
-------
symmetric_difference : Index
Expand Down Expand Up @@ -2524,11 +2584,13 @@ def symmetric_difference(self, other, result_name=None, sort=True):
right_diff = other.values.take(right_indexer)

the_diff = _concat._concat_compat([left_diff, right_diff])
if sort:
if sort is None:
try:
the_diff = sorting.safe_sort(the_diff)
except TypeError:
pass
elif sort:
the_diff = sorting.safe_sort(the_diff)

attribs = self._get_attributes_dict()
attribs['name'] = result_name
Expand Down
34 changes: 29 additions & 5 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -2879,18 +2879,34 @@ def equal_levels(self, other):
return False
return True

def union(self, other, sort=True):
def union(self, other, sort=None):
"""
Form the union of two MultiIndex objects

Parameters
----------
other : MultiIndex or array / Index of tuples
sort : bool, default True
Sort the resulting MultiIndex if possible
sort : bool or None, default None
Whether to sort the resulting Index.

* None : Sort the result, except when

1. `self` and `other` are equal.
2. `self` has length 0.
3. Some values in `self` or `other` cannot be compared.
A RuntimeWarning is issued in this case.

* True : sort the result. A TypeError is raised when the
values cannot be compared.
* False : do not sort the result.

.. versionadded:: 0.24.0

.. versionchanged:: 0.24.1

Changed the default `sort` to None, matching the
behavior of pandas 0.23.4 and earlier.

Returns
-------
Index
Expand All @@ -2901,8 +2917,12 @@ def union(self, other, sort=True):
other, result_names = self._convert_can_do_setop(other)

if len(other) == 0 or self.equals(other):
if sort:
return self.sort_values()
return self

# TODO: Index.union returns other when `len(self)` is 0.

uniq_tuples = lib.fast_unique_multiple([self._ndarray_values,
other._ndarray_values],
sort=sort)
Expand All @@ -2917,7 +2937,7 @@ def intersection(self, other, sort=False):
Parameters
----------
other : MultiIndex or array / Index of tuples
sort : bool, default True
sort : bool, default False
Sort the resulting MultiIndex if possible

.. versionadded:: 0.24.0
Expand All @@ -2934,6 +2954,8 @@ def intersection(self, other, sort=False):
other, result_names = self._convert_can_do_setop(other)

if self.equals(other):
if sort:
return self.sort_values()
return self

self_tuples = self._ndarray_values
Expand All @@ -2951,7 +2973,7 @@ def intersection(self, other, sort=False):
return MultiIndex.from_arrays(lzip(*uniq_tuples), sortorder=0,
names=result_names)

def difference(self, other, sort=True):
def difference(self, other, sort=None):
"""
Compute set difference of two MultiIndex objects

Expand All @@ -2971,6 +2993,8 @@ def difference(self, other, sort=True):
other, result_names = self._convert_can_do_setop(other)

if len(other) == 0:
if sort:
return self.sort_values()
return self

if self.equals(other):
Expand Down
Loading