API/WIP: .sorted #10726

jreback · 2015-08-02T17:00:55Z

closes #9816
closes #8239

Changes to sorting API
^^^^^^^^^^^^^^^^^^^^^^

The sorting API has had some longtime inconsistencies. (:issue:`9816`,:issue:`8239`).

Here is a summary of the **prior** to 0.17.0 API

- ``Series.sort`` is **INPLACE** while ``DataFrame.sort`` returns a new object.
- ``Series.order`` returned a new object
- It was possible to use ``Series/DataFrame.sort_index`` to sort by **values** by passing the ``by`` keyword.
- ``Series/DataFrame.sortlevel`` worked only on a ``MultiIndex`` for sorting by index.

To address these issues, we have revamped the API:

- We have introduced a new method, ``.sort_values()``, which is the merger of ``DataFrame.sort()``, ``Series.sort()``,
  and ``Series.order``, to handle sorting of **values**.
- The existing method ``Series.sort()`` has been deprecated and will be removed in a
  future version of pandas.
- The ``by`` argument of ``DataFrame.sort_index()`` has been deprecated and will be removed in a future version of pandas.
- The methods ``DataFrame.sort()``, ``Series.order()``, will not be recommended to use and will carry a deprecation warning
  in the doc-string.
- The existing method ``.sort_index()`` will gain the ``level`` keyword to enable level sorting.

We now have two distinct and non-overlapping methods of sorting. A ``*`` marks items that
will show a ``PendingDeprecationWarning`` (normally suppressed by python), and a ``+`` marks items that
will show a ``FutureWarning``.

To sort by the **values**:

=================================     ====================================
Previous                              Replacement
=================================     ====================================
*``Series.order()``                   ``Series.sort_values()``
+``Series.sort()``                    ``Series.sort_values(inplace=True)``
``DataFrame.sort(columns=...)``       ``DataFrame.sort_values(by=...)``
=================================     ====================================

To sort by the **index**:

=================================     ====================================
Previous                              Equivalent
=================================     ====================================
``Series.sort_index()``               ``Series.sort_index()``
``Series.sortlevel(level=...)``       ``Series.sort_index(level=...``)
``DataFrame.sort_index()``            ``DataFrame.sort_index()``
``DataFrame.sortlevel(level=...)``    ``DataFrame.sort_index(level=...)``
*``DataFrame.sort()``                 ``DataFrame.sort_index()``
==================================    ====================================

jreback · 2015-08-02T17:06:27Z

xref #10721

jreback · 2015-08-02T20:59:14Z

cc @brandon-rhodes
cc @patricktokeeffe

shoyer · 2015-08-02T22:02:59Z

pandas/core/generic.py

+       Sort by labels (along either axis), by the values in column(s) or both.
+
+       If both, labels take precedence over columns. If neither is specified,
+       behavior is object-dependent: Series = on values, Dataframe = on index.


Maybe now would be a good time to clean up this API discrepancy?

just copied the doc-string from another PR. What do you think it should say/do?

shoyer · 2015-08-02T22:37:59Z

I wonder if we want to take this as a opportunity to clean up this discrepancy for default sort behavior:

a series is sorted by its values
a dataframe is sorted by its index

It might make more sense for .sorted() to always sort by values, and require another function or argument to sort the index.

patricktokeeffe · 2015-08-02T22:53:35Z

It might make more sense for .sorted() to always sort by values, and require another function or argument to sort the index.

How would that work for dataframes? What does 'values' mean in that context? I think it makes more sense to target their commonality: index.

jreback · 2015-08-02T23:00:56Z

I think what @shoyer mean is the default could be:

This is False in master

s.sorted() == s.sort_index()

proposed:
would need to something like: s.sorted('values') (could also take the name of the Series)

s.sorted('values') == s.sorted('A') == s.order() # s.order() is the current sort the values

This is True in master

df.sorted() == df.sort_index()
df.sorted('A') == df.sort('A') # this is currently how it works, e.g. column sorting

jreback · 2015-08-02T23:02:16Z

@hayd @sinhrks @TomAugspurger @jorisvandenbossche

patricktokeeffe · 2015-08-02T23:09:21Z

OK, I see. Yes, I had misunderstood @shoyer, if that's what was meant. Thanks

shoyer · 2015-08-02T23:27:02Z

How would that work for dataframes? What does 'values' mean in that context? I think it makes more sense to target their commonality: index.

I was thinking that the default for df.sorted() might be to sort the values lexicographically in column order, e.g., the current result obtained from df.sort(list(df.columns)). But we could also take the approach of sorting the index by default instead in both cases.

Alternatively, instead of adding sorted, we might encourage the more explicit .sort_values() and .sort_index()

jreback · 2015-08-02T23:54:30Z

yeh I think .sort_index should stick around, and I like .sort_values.

==============================     ===============================
Previous                           Replacement
==============================     ===============================
``Series.order()``                 ``Series.sort_values()``
``Series.sort()``                  ``Series.sort_values(inplace=True)``
``DataFrame.sort(columns=...)``    ``DataFrame.sort_values(by=...)``
==============================     ===============================

Furthermore, the following operations are implemented using .sorted(); the original methods remain for convenience.

==============================     ===============================
Previous                           Equivalent
==============================     ===============================
``Series.sort_index()``            ``Series.sort_index()``
``Series.sortlevel(level=...)``    ``Series.sort_index(level=...``)
``DataFrame.sort_index()``         ``DataFrame.sort_index(level=True)``
``DataFrame.sortlevel(level=...)`` ``DataFrame.sort_index(level=...)``
==============================     ===============================

patricktokeeffe · 2015-08-03T01:22:22Z

I was thinking that the default for df.sorted() might be to sort the values lexicographically in column order, e.g., the current result obtained from df.sort(list(df.columns)).

I think this trades one discrepancy (values vs. index) for another (default sort on axis=0 vs axis=1). Currently df.sort() is by index, which always made more sense to me. (Disclaimer: I work with timeseries.) What happens if/when panel gets sorted? For default behaviors, acting on index seems more appropriate.

shoyer · 2015-08-03T07:00:58Z

I think this trades one discrepancy (values vs. index) for another (default sort on axis=0 vs axis=1).

Actually, df.sort already defaults to axis=0. I agree that this is confusing, though -- it's not obviously clear whether axis=0 or even axis='rows' means that the result will be sorted by rows or sorted along rows.

Currently df.sort() is by index, which always made more sense to me. (Disclaimer: I work with timeseries.)

I agree that sorting by the index usually makes sense -- if you have defined a meaningful index. This is often not the case with pandas, and for such users the default behavior of df.sort() will be confusing.

I think there is a lot to be said for requiring and/or encouraging users to be more explicit by having separate sort_index and sort_values methods.

The main downside is that it forces users to considering whether something is in the index or the values, which is another confusing distinction we are trying to slowly get away with. So, there could still be a case for having a generic .sort/.sorted method on DataFrames that requires the by argument and parses the listed names as either columns or index levels.

jorisvandenbossche · 2015-08-04T00:23:14Z

@jreback Thanks for working on this!

I made an overview of the issue (gathered from the different issues, for who wants to catch up): https://github.com/jorisvandenbossche/pandas/blob/sorting-api/doc/proposals/sorting-API.md

I can make a PR from that if that is easier to comment, but copying the discussion points I see here:

Discussion points:

Default sorting by labels or values?
- Proposed PR: keep inconsistency of Series by values and DataFrame by labels.
- Alternative: As sorting by the values (certainly for a Series) is very convenient, unify the sorted method to sort by values/columns for both Series/DataFrame. To have a convenient method to sort by the index, keep the specific sort_index method.
  
  a) Should sorted still be able to sort on the index? If not (as sort_index does this), we can leave level and sort_remaining keywords out of the signature.
  
  b) Should DataFrame.sorted require at least one column to be specified, or should it default to sort the values lexicographically in column order (equivalent to df.sorted(by=list(df.columns)))?
  
  c) If we go with this clear separation of sorting by index/values in two separate functions, do we use sorted, or something more specific as sort_values() alongside sort_index
The keyword to select the columns to sort?
- Current PR: chooses by over columns
Should Series.sort, Series.order and DataFrame.sort be deprecated?
- Current PR does deprecate these.
- As these are widespread functions, a real deprecation/removal can have a large impact, and maybe a clear 'documented' deprecations is enough?
And for the integration of sortlevel into sort_index:
- Not controversial: add level and sort_remaining to sort_index()
Should sort_index still be able to sort by the columns?
- It is very strange to use sort_index to sort by the columns (but even the implementation of DataFrame.sort uses DataFrame.sort_index).
- We could deprecate this ability (the deprecating/discouraging the by keyword)

jorisvandenbossche · 2015-08-04T00:32:33Z

And my 2 cents at the moment:

I would like to see the default of sorting by index/values cleaned-up and consistent between dataframe and series. And I like the proposal of keeping a separate function (as we already have this: sort_index) for this, and having sorted default to the values.
BTW, I also find df.sorted(level=True) very unintuitive as the recommended way to sort by the index, so this also argues for a clear separate sort_index IMO

1a) I think the new sorted should not be able to sort the index. As this is not needed (you use sort_index for that), and simplifies the API (less keyword arguments). A possible argument to keep this is to do combined sorting on index an columns, but this is currently also not possible.

Related to this, I also don't think sort_index should be able to sort on the columns (I know this is already there, but we could deprecate this, or at least change the documentation (it is now even an example in the docstring of sort_index))

jreback · 2015-08-04T21:21:20Z

FYI, the level=True stuff was just to get all in one method! Its bad I agree :)

ok so here are my thoughts

so a) agree

.sorted(..) is for values
.sort_index(..) is for the index

so we enforce the by argument to .sorted (on DataFrame)
remove the index sorting capability, IOW, levels/sort_remaining & add to sort_index.
leave .sortlevel as an alias for now (maybe soft-deprecate)

b) I don't think sorting by all columns is intuitive at all, user should simply have to specify, -1 here.

c) sorted is good. SQL-like and matches the name in python, also clear sep from sort_index.

by just for consistency, but I guess it really doesn't matter as the signature is specific to DataFrame, could go either way here
I think we MUST deprecate Series.sort. Ok with soft-deprecating Series.order (e.g. via doc-string). though I think it IS cleaner to deprecate it now.
DataFrame.sort sort is like Series.order, we are not really changing it, (except removing the sort index capability, but that would be in .sorted, I don't think we can touch it here). So soft-deprecation also ok.
yes, add .levels/sort_remaining to .sort_index
I am torn here. and I see why this is allowed in .sort_index. You want to be able to sort by a column, but you don't care if its set as the index. But that is inconsistent with the model of having a method do a single well-defined thing. E.g. .sorted sorts values, .sort_index sorts the index.

patricktokeeffe · 2015-08-04T21:54:52Z

FWIW, I agree with these recent comments. Had to take several steps back and consider how other generalized functions are handled before deciding the index isn't the best target after all.

Sorting dataframes by all their columns doesn't feel right, though - the user should have to decide.

shoyer · 2015-08-04T22:00:27Z

I'm perfectly fine with requiring an explicit by argument when sorting dataframes by values with sorted. That will also ensure an easier transition from DataFrame.sort.

sinhrks · 2015-08-04T22:44:23Z

I agree current proposals, great work:)

As functionality, sort_index should cover sorting by columns. I don't feel the name is inappropriate because there are some methods named index.. covers all axis, like reindex and idxmin.

jreback · 2015-08-05T13:24:13Z

So It think that I understand now why @wesm has a .sortlevel rather than just have this allowed in .sort_index.

Imagine

In [13]: df = DataFrame(np.arange(12,0,-1).reshape(3,-1),index=pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0)]))

In [14]: df
Out[14]: 
      0   1   2  3
0 0  12  11  10  9
  1   8   7   6  5
1 0   4   3   2  1

In [15]: df.sortlevel(0)
Out[15]: 
      0   1   2  3
0 0  12  11  10  9
  1   8   7   6  5
1 0   4   3   2  1

in the new .sort_index, then this is ambiguous
df.sort_index(by=0) ,IS that the 0'th column or the 0th level of the index?

so we would now have to do
df.sort_index(by=None,level=0) to be unambiguous (this adds the .level kw)

If we were NOT to allow pass thru columns, then the signature would be almost exactly like .sortlevel
e.g.
df.sort_index(level=...)

currently this will fail (in master) as you cannot sort a non-MultiIndex by level (though that is easily fixed).

so, I think to avoid ambiguity then we must either pick:

5a) leave .sort_index alone, and consequently .sortlevel
5b) add level kw to .sort_index IN ADDITION to the current by kw, these are discreet sorters in that by would be for the columns and level for the levels of the index, so avoids any ambiguity. The user is then in charge of saying, hey sort by these columns or these levels (in theory we could actually do a multi-sort).
5c) deprecate by and add level, avoids ambiguity but disallows the columns sorting pass thru
5d) for labels, you can actually figure out whether its a level or a column (if their are duplicates then you multi-sort or raise), so we could use the by keyword only, BUT, the issue is that you can sort on a level number! So we could not allow level numbers in ambiguous cases. more complicated but maybe more natural.

hayd · 2015-08-05T16:00:31Z

+1 for 5b. df.sort_index(level=...) is very readable/clear.

jorisvandenbossche · 2015-08-05T16:10:42Z

I don't really see a reason to not integrate sortlevel in sort_index. I would use the level keyword in any case for this (and not by, as level is more consistent with other functions).

So I would go for 5c, although 5b is also OK.
I am in favor of deprecating the by keyword (but soft-deprecation is enough I think -> not encouraging it in the docs), as this would make the API just cleaner: sorted for the values, sort_index for the labels.
I don't see a reason that sort_index should also be able to sort on the columns (apart from historical ones of course: it already can)

…andas-dev#8239 DEPR: remove of na_last from Series.order/Series.sort, xref pandas-dev#5231

jreback · 2015-08-18T00:09:46Z

ok, I reverted on this a bit. Now have full deprecation (e.g. FutureWarning) for the things we are changing (e.g. Series.order/Series.sort/DataFrame.sort/Index.order/Categorical.order). Its just better to get this over and done with and move on. Let's not let this linger.

jreback · 2015-08-19T00:55:44Z

also like to get this sorted soon (get it!!!!)

@jorisvandenbossche
cc @patricktokeeffe
@hayd

hayd · 2015-08-19T01:08:41Z

This looks great! +1 to deprecating to df.sort() et al. 🔥

~~What's the deal with the na_last (I'm sure I've used this for Series in the past).~~ Ah, I see, it's replaced with na_position, cool.

jreback · 2015-08-21T14:02:18Z

ok, I am just going to merge this, and we'll see who whines :)

API/WIP: .sorted

TomAugspurger · 2015-08-21T14:15:27Z

nice work!

jorisvandenbossche · 2015-08-22T12:12:26Z

Yes, I am really glad we have a nicer interface now!

The reason I pressed for DeprecationWarning instead of FutureWarning for now, is that when you upgrade your pandas version and eg using seaborn, you will get a lot of warnings that come from seaborn and you cannot solve.
But it's good it is in master now, we can always change that easily if we get too much complaints.

jreback · 2015-08-22T20:23:48Z

@jorisvandenbossche ok, let's leave that as an item for the rc then. maybe will make that change if too much complaining.

fix some stacklevels on warnings

Some failing tests in the previous commits because older ``pandas`` versions don't have ``Series.sort_values``. That method was only added in pandas 0.17, in pandas-dev/pandas#10726

Clarifies the meaning of 'sort' in the context of `Categorical` to mean 'organization' rather than 'order', as it is possible to call this method (as well as `sort_values`) when the `Categorical` is unordered. Also patches a bug in `Categorical.sort_values` in which `na_position` was not being respected when `ascending` was set to `True`. This commit aligns the behaviour with that of `Series`. Finally, deprecates `sort` in favor of `sort_values`, which is in alignment with what was done with `Series` back in #10726. Closes #12785 Author: gfyoung <gfyoung17@gmail.com> Closes #12882 from gfyoung/categorical-sort-doc and squashes the following commits: f324a9c [gfyoung] BUG, DOC, DEP: Patch and Align Categorical's Sorting API

Affect classes: 1) Index 2) Series 2) DataFrame xref pandas-devgh-10726

Affect classes: 1) Index 2) Series 2) DataFrame xref gh-10726

Affect classes: 1) Index 2) Series 2) DataFrame xref pandas-devgh-10726

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Aug 2, 2015

jreback added this to the 0.17.0 milestone Aug 2, 2015

jreback added the Needs Discussion Requires discussion from core team before further action label Aug 2, 2015

shoyer reviewed Aug 2, 2015
View reviewed changes

jreback force-pushed the sorted branch from a3eb0ef to 8aef89c Compare August 2, 2015 22:03

jreback force-pushed the sorted branch from 8aef89c to 2fa541e Compare August 2, 2015 22:54

patricktokeeffe mentioned this pull request Aug 2, 2015

API: unified sorting #8239

Closed

jreback force-pushed the sorted branch from 2fa541e to 81566eb Compare August 2, 2015 23:04

jreback force-pushed the sorted branch from 81566eb to 80d97e1 Compare August 3, 2015 12:41

jreback force-pushed the sorted branch from 3d0ab87 to 099d6bf Compare August 17, 2015 11:52

ENH/DEPR: add .sorted() method for API consistency, pandas-dev#9816, p…

13d2d71

…andas-dev#8239 DEPR: remove of na_last from Series.order/Series.sort, xref pandas-dev#5231

jreback force-pushed the sorted branch from 099d6bf to 13d2d71 Compare August 18, 2015 00:07

jreback added a commit that referenced this pull request Aug 21, 2015

Merge pull request #10726 from jreback/sorted

a6ee127

API/WIP: .sorted

jreback merged commit a6ee127 into pandas-dev:master Aug 21, 2015

jsexauer mentioned this pull request Aug 21, 2015

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jorisvandenbossche mentioned this pull request Aug 22, 2015

DOC: update docs to use new sort_values #10886

Closed

jorisvandenbossche mentioned this pull request Aug 22, 2015

REL: 0.17.0 #10848

Closed

2 tasks

jreback added a commit that referenced this pull request Aug 24, 2015

remove deprecated test code xref #10726

24c147a

fix some stacklevels on warnings

gfyoung mentioned this pull request Apr 12, 2016

BUG, DEP, DOC: Patch and Align Categorical's Sorting API #12882

Closed

jreback mentioned this pull request Jul 24, 2016

DEPR: deprecations log for removed issues #13777

Closed

chris-b1 mentioned this pull request Sep 22, 2016

DEPR: sortlevel #14279

Closed

jonmmease mentioned this pull request Oct 5, 2016

ENH: Support sorting DataFrames by a combination of columns and index levels #14353

Closed

gfyoung added a commit to forking-repos/pandas that referenced this pull request Mar 18, 2017

MAINT: Drop order and sort from pandas objects

16d259e

Affect classes: 1) Index 2) Series 2) DataFrame xref pandas-devgh-10726

gfyoung added a commit to forking-repos/pandas that referenced this pull request Mar 19, 2017

MAINT: Drop order and sort from pandas objects

f4c085e

Affect classes: 1) Index 2) Series 2) DataFrame xref pandas-devgh-10726

gfyoung added a commit to forking-repos/pandas that referenced this pull request Mar 19, 2017

MAINT: Drop order and sort from pandas objects

f268b93

Affect classes: 1) Index 2) Series 2) DataFrame xref pandas-devgh-10726

gfyoung mentioned this pull request Mar 19, 2017

MAINT: Drop order and sort from pandas objects #15735

Merged

jreback pushed a commit that referenced this pull request Mar 19, 2017

MAINT: Drop order and sort from pandas objects (#15735)

9ab57dc

Affect classes: 1) Index 2) Series 2) DataFrame xref gh-10726

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017

MAINT: Drop order and sort from pandas objects (pandas-dev#15735)

c2e5beb

Affect classes: 1) Index 2) Series 2) DataFrame xref pandas-devgh-10726

mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017

MAINT: Drop order and sort from pandas objects (pandas-dev#15735)

2534260

Affect classes: 1) Index 2) Series 2) DataFrame xref pandas-devgh-10726

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/WIP: .sorted #10726

API/WIP: .sorted #10726

jreback commented Aug 2, 2015

jreback commented Aug 2, 2015

jreback commented Aug 2, 2015

shoyer Aug 2, 2015

jreback Aug 2, 2015

shoyer commented Aug 2, 2015

patricktokeeffe commented Aug 2, 2015

jreback commented Aug 2, 2015

jreback commented Aug 2, 2015

patricktokeeffe commented Aug 2, 2015

shoyer commented Aug 2, 2015

jreback commented Aug 2, 2015

patricktokeeffe commented Aug 3, 2015

shoyer commented Aug 3, 2015

jorisvandenbossche commented Aug 4, 2015

jorisvandenbossche commented Aug 4, 2015

jreback commented Aug 4, 2015

patricktokeeffe commented Aug 4, 2015

shoyer commented Aug 4, 2015

sinhrks commented Aug 4, 2015

jreback commented Aug 5, 2015

hayd commented Aug 5, 2015

jorisvandenbossche commented Aug 5, 2015

jreback commented Aug 18, 2015

jreback commented Aug 19, 2015

hayd commented Aug 19, 2015

jreback commented Aug 21, 2015

TomAugspurger commented Aug 21, 2015

jorisvandenbossche commented Aug 22, 2015

jreback commented Aug 22, 2015

API/WIP: .sorted #10726

API/WIP: .sorted #10726

Conversation

jreback commented Aug 2, 2015

jreback commented Aug 2, 2015

jreback commented Aug 2, 2015

shoyer Aug 2, 2015

Choose a reason for hiding this comment

jreback Aug 2, 2015

Choose a reason for hiding this comment

shoyer commented Aug 2, 2015

patricktokeeffe commented Aug 2, 2015

jreback commented Aug 2, 2015

jreback commented Aug 2, 2015

patricktokeeffe commented Aug 2, 2015

shoyer commented Aug 2, 2015

jreback commented Aug 2, 2015

patricktokeeffe commented Aug 3, 2015

shoyer commented Aug 3, 2015

jorisvandenbossche commented Aug 4, 2015

jorisvandenbossche commented Aug 4, 2015

jreback commented Aug 4, 2015

patricktokeeffe commented Aug 4, 2015

shoyer commented Aug 4, 2015

sinhrks commented Aug 4, 2015

jreback commented Aug 5, 2015

hayd commented Aug 5, 2015

jorisvandenbossche commented Aug 5, 2015

jreback commented Aug 18, 2015

jreback commented Aug 19, 2015

hayd commented Aug 19, 2015

jreback commented Aug 21, 2015

TomAugspurger commented Aug 21, 2015

jorisvandenbossche commented Aug 22, 2015

jreback commented Aug 22, 2015