Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/WIP: .sorted #10726

Merged
merged 1 commit into from
Aug 21, 2015
Merged

API/WIP: .sorted #10726

merged 1 commit into from
Aug 21, 2015

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Aug 2, 2015

closes #9816
closes #8239

Changes to sorting API
^^^^^^^^^^^^^^^^^^^^^^

The sorting API has had some longtime inconsistencies. (:issue:`9816`,:issue:`8239`).

Here is a summary of the **prior** to 0.17.0 API

- ``Series.sort`` is **INPLACE** while ``DataFrame.sort`` returns a new object.
- ``Series.order`` returned a new object
- It was possible to use ``Series/DataFrame.sort_index`` to sort by **values** by passing the ``by`` keyword.
- ``Series/DataFrame.sortlevel`` worked only on a ``MultiIndex`` for sorting by index.

To address these issues, we have revamped the API:

- We have introduced a new method, ``.sort_values()``, which is the merger of ``DataFrame.sort()``, ``Series.sort()``,
  and ``Series.order``, to handle sorting of **values**.
- The existing method ``Series.sort()`` has been deprecated and will be removed in a
  future version of pandas.
- The ``by`` argument of ``DataFrame.sort_index()`` has been deprecated and will be removed in a future version of pandas.
- The methods ``DataFrame.sort()``, ``Series.order()``, will not be recommended to use and will carry a deprecation warning
  in the doc-string.
- The existing method ``.sort_index()`` will gain the ``level`` keyword to enable level sorting.

We now have two distinct and non-overlapping methods of sorting. A ``*`` marks items that
will show a ``PendingDeprecationWarning`` (normally suppressed by python), and a ``+`` marks items that
will show a ``FutureWarning``.

To sort by the **values**:

=================================     ====================================
Previous                              Replacement
=================================     ====================================
*``Series.order()``                   ``Series.sort_values()``
+``Series.sort()``                    ``Series.sort_values(inplace=True)``
``DataFrame.sort(columns=...)``       ``DataFrame.sort_values(by=...)``
=================================     ====================================

To sort by the **index**:

=================================     ====================================
Previous                              Equivalent
=================================     ====================================
``Series.sort_index()``               ``Series.sort_index()``
``Series.sortlevel(level=...)``       ``Series.sort_index(level=...``)
``DataFrame.sort_index()``            ``DataFrame.sort_index()``
``DataFrame.sortlevel(level=...)``    ``DataFrame.sort_index(level=...)``
*``DataFrame.sort()``                 ``DataFrame.sort_index()``
==================================    ====================================

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Aug 2, 2015
@jreback jreback added this to the 0.17.0 milestone Aug 2, 2015
@jreback jreback added the Needs Discussion Requires discussion from core team before further action label Aug 2, 2015
@jreback
Copy link
Contributor Author

jreback commented Aug 2, 2015

xref #10721

@jreback
Copy link
Contributor Author

jreback commented Aug 2, 2015

cc @brandon-rhodes
cc @patricktokeeffe

Sort by labels (along either axis), by the values in column(s) or both.

If both, labels take precedence over columns. If neither is specified,
behavior is object-dependent: Series = on values, Dataframe = on index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe now would be a good time to clean up this API discrepancy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just copied the doc-string from another PR. What do you think it should say/do?

@shoyer
Copy link
Member

shoyer commented Aug 2, 2015

I wonder if we want to take this as a opportunity to clean up this discrepancy for default sort behavior:

  • a series is sorted by its values
  • a dataframe is sorted by its index

It might make more sense for .sorted() to always sort by values, and require another function or argument to sort the index.

@patricktokeeffe
Copy link
Contributor

It might make more sense for .sorted() to always sort by values, and require another function or argument to sort the index.

How would that work for dataframes? What does 'values' mean in that context? I think it makes more sense to target their commonality: index.

@jreback
Copy link
Contributor Author

jreback commented Aug 2, 2015

I think what @shoyer mean is the default could be:

This is False in master

s.sorted() == s.sort_index()

proposed:
would need to something like: s.sorted('values') (could also take the name of the Series)

s.sorted('values') == s.sorted('A') == s.order() # s.order() is the current sort the values

This is True in master

df.sorted() == df.sort_index()
df.sorted('A') == df.sort('A') # this is currently how it works, e.g. column sorting

@jreback
Copy link
Contributor Author

jreback commented Aug 2, 2015

@patricktokeeffe
Copy link
Contributor

OK, I see. Yes, I had misunderstood @shoyer, if that's what was meant. Thanks

@shoyer
Copy link
Member

shoyer commented Aug 2, 2015

How would that work for dataframes? What does 'values' mean in that context? I think it makes more sense to target their commonality: index.

I was thinking that the default for df.sorted() might be to sort the values lexicographically in column order, e.g., the current result obtained from df.sort(list(df.columns)). But we could also take the approach of sorting the index by default instead in both cases.

Alternatively, instead of adding sorted, we might encourage the more explicit .sort_values() and .sort_index()

@jreback
Copy link
Contributor Author

jreback commented Aug 2, 2015

yeh I think .sort_index should stick around, and I like .sort_values.

==============================     ===============================
Previous                           Replacement
==============================     ===============================
``Series.order()``                 ``Series.sort_values()``
``Series.sort()``                  ``Series.sort_values(inplace=True)``
``DataFrame.sort(columns=...)``    ``DataFrame.sort_values(by=...)``
==============================     ===============================

Furthermore, the following operations are implemented using .sorted(); the original methods remain for convenience.

==============================     ===============================
Previous                           Equivalent
==============================     ===============================
``Series.sort_index()``            ``Series.sort_index()``
``Series.sortlevel(level=...)``    ``Series.sort_index(level=...``)
``DataFrame.sort_index()``         ``DataFrame.sort_index(level=True)``
``DataFrame.sortlevel(level=...)`` ``DataFrame.sort_index(level=...)``
==============================     ===============================

@patricktokeeffe
Copy link
Contributor

I was thinking that the default for df.sorted() might be to sort the values lexicographically in column order, e.g., the current result obtained from df.sort(list(df.columns)).

I think this trades one discrepancy (values vs. index) for another (default sort on axis=0 vs axis=1). Currently df.sort() is by index, which always made more sense to me. (Disclaimer: I work with timeseries.) What happens if/when panel gets sorted? For default behaviors, acting on index seems more appropriate.

@shoyer
Copy link
Member

shoyer commented Aug 3, 2015

I think this trades one discrepancy (values vs. index) for another (default sort on axis=0 vs axis=1).

Actually, df.sort already defaults to axis=0. I agree that this is confusing, though -- it's not obviously clear whether axis=0 or even axis='rows' means that the result will be sorted by rows or sorted along rows.

Currently df.sort() is by index, which always made more sense to me. (Disclaimer: I work with timeseries.)

I agree that sorting by the index usually makes sense -- if you have defined a meaningful index. This is often not the case with pandas, and for such users the default behavior of df.sort() will be confusing.

I think there is a lot to be said for requiring and/or encouraging users to be more explicit by having separate sort_index and sort_values methods.

The main downside is that it forces users to considering whether something is in the index or the values, which is another confusing distinction we are trying to slowly get away with. So, there could still be a case for having a generic .sort/.sorted method on DataFrames that requires the by argument and parses the listed names as either columns or index levels.

@jorisvandenbossche
Copy link
Member

@jreback Thanks for working on this!

I made an overview of the issue (gathered from the different issues, for who wants to catch up): https://github.com/jorisvandenbossche/pandas/blob/sorting-api/doc/proposals/sorting-API.md

I can make a PR from that if that is easier to comment, but copying the discussion points I see here:

Discussion points:

  1. Default sorting by labels or values?
    • Proposed PR: keep inconsistency of Series by values and DataFrame by labels.

    • Alternative: As sorting by the values (certainly for a Series) is very convenient, unify the sorted method to sort by values/columns for both Series/DataFrame. To have a convenient method to sort by the index, keep the specific sort_index method.

      a) Should sorted still be able to sort on the index? If not (as sort_index does this), we can leave level and sort_remaining keywords out of the signature.

      b) Should DataFrame.sorted require at least one column to be specified, or should it default to sort the values lexicographically in column order (equivalent to df.sorted(by=list(df.columns)))?

      c) If we go with this clear separation of sorting by index/values in two separate functions, do we use sorted, or something more specific as sort_values() alongside sort_index

  2. The keyword to select the columns to sort?
    • Current PR: chooses by over columns
  3. Should Series.sort, Series.order and DataFrame.sort be deprecated?
    • Current PR does deprecate these.
    • As these are widespread functions, a real deprecation/removal can have a large impact, and maybe a clear 'documented' deprecations is enough?
  4. And for the integration of sortlevel into sort_index:
    • Not controversial: add level and sort_remaining to sort_index()
  5. Should sort_index still be able to sort by the columns?
    • It is very strange to use sort_index to sort by the columns (but even the implementation of DataFrame.sort uses DataFrame.sort_index).
    • We could deprecate this ability (the deprecating/discouraging the by keyword)

@jorisvandenbossche
Copy link
Member

And my 2 cents at the moment:

  1. I would like to see the default of sorting by index/values cleaned-up and consistent between dataframe and series. And I like the proposal of keeping a separate function (as we already have this: sort_index) for this, and having sorted default to the values.
    BTW, I also find df.sorted(level=True) very unintuitive as the recommended way to sort by the index, so this also argues for a clear separate sort_index IMO

1a) I think the new sorted should not be able to sort the index. As this is not needed (you use sort_index for that), and simplifies the API (less keyword arguments). A possible argument to keep this is to do combined sorting on index an columns, but this is currently also not possible.

Related to this, I also don't think sort_index should be able to sort on the columns (I know this is already there, but we could deprecate this, or at least change the documentation (it is now even an example in the docstring of sort_index))

@jreback
Copy link
Contributor Author

jreback commented Aug 4, 2015

FYI, the level=True stuff was just to get all in one method! Its bad I agree :)

ok so here are my thoughts

  1. so a) agree

.sorted(..) is for values
.sort_index(..) is for the index

  • so we enforce the by argument to .sorted (on DataFrame)
  • remove the index sorting capability, IOW, levels/sort_remaining & add to sort_index.
  • leave .sortlevel as an alias for now (maybe soft-deprecate)

b) I don't think sorting by all columns is intuitive at all, user should simply have to specify, -1 here.

c) sorted is good. SQL-like and matches the name in python, also clear sep from sort_index.

  1. by just for consistency, but I guess it really doesn't matter as the signature is specific to DataFrame, could go either way here

  2. I think we MUST deprecate Series.sort. Ok with soft-deprecating Series.order (e.g. via doc-string). though I think it IS cleaner to deprecate it now.
    DataFrame.sort sort is like Series.order, we are not really changing it, (except removing the sort index capability, but that would be in .sorted, I don't think we can touch it here). So soft-deprecation also ok.

  3. yes, add .levels/sort_remaining to .sort_index

  4. I am torn here. and I see why this is allowed in .sort_index. You want to be able to sort by a column, but you don't care if its set as the index. But that is inconsistent with the model of having a method do a single well-defined thing. E.g. .sorted sorts values, .sort_index sorts the index.

@patricktokeeffe
Copy link
Contributor

FWIW, I agree with these recent comments. Had to take several steps back and consider how other generalized functions are handled before deciding the index isn't the best target after all.

Sorting dataframes by all their columns doesn't feel right, though - the user should have to decide.

@shoyer
Copy link
Member

shoyer commented Aug 4, 2015

I'm perfectly fine with requiring an explicit by argument when sorting dataframes by values with sorted. That will also ensure an easier transition from DataFrame.sort.

@sinhrks
Copy link
Member

sinhrks commented Aug 4, 2015

I agree current proposals, great work:)

  1. As functionality, sort_index should cover sorting by columns. I don't feel the name is inappropriate because there are some methods named index.. covers all axis, like reindex and idxmin.

@jreback
Copy link
Contributor Author

jreback commented Aug 5, 2015

So It think that I understand now why @wesm has a .sortlevel rather than just have this allowed in .sort_index.

Imagine

In [13]: df = DataFrame(np.arange(12,0,-1).reshape(3,-1),index=pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0)]))

In [14]: df
Out[14]: 
      0   1   2  3
0 0  12  11  10  9
  1   8   7   6  5
1 0   4   3   2  1

In [15]: df.sortlevel(0)
Out[15]: 
      0   1   2  3
0 0  12  11  10  9
  1   8   7   6  5
1 0   4   3   2  1

in the new .sort_index, then this is ambiguous
df.sort_index(by=0) ,IS that the 0'th column or the 0th level of the index?

so we would now have to do
df.sort_index(by=None,level=0) to be unambiguous (this adds the .level kw)

If we were NOT to allow pass thru columns, then the signature would be almost exactly like .sortlevel
e.g.
df.sort_index(level=...)

currently this will fail (in master) as you cannot sort a non-MultiIndex by level (though that is easily fixed).

so, I think to avoid ambiguity then we must either pick:

  • 5a) leave .sort_index alone, and consequently .sortlevel
  • 5b) add level kw to .sort_index IN ADDITION to the current by kw, these are discreet sorters in that by would be for the columns and level for the levels of the index, so avoids any ambiguity. The user is then in charge of saying, hey sort by these columns or these levels (in theory we could actually do a multi-sort).
  • 5c) deprecate by and add level, avoids ambiguity but disallows the columns sorting pass thru
  • 5d) for labels, you can actually figure out whether its a level or a column (if their are duplicates then you multi-sort or raise), so we could use the by keyword only, BUT, the issue is that you can sort on a level number! So we could not allow level numbers in ambiguous cases. more complicated but maybe more natural.

@hayd
Copy link
Contributor

hayd commented Aug 5, 2015

+1 for 5b. df.sort_index(level=...) is very readable/clear.

@jorisvandenbossche
Copy link
Member

I don't really see a reason to not integrate sortlevel in sort_index. I would use the level keyword in any case for this (and not by, as level is more consistent with other functions).

So I would go for 5c, although 5b is also OK.
I am in favor of deprecating the by keyword (but soft-deprecation is enough I think -> not encouraging it in the docs), as this would make the API just cleaner: sorted for the values, sort_index for the labels.
I don't see a reason that sort_index should also be able to sort on the columns (apart from historical ones of course: it already can)

@jreback
Copy link
Contributor Author

jreback commented Aug 18, 2015

ok, I reverted on this a bit. Now have full deprecation (e.g. FutureWarning) for the things we are changing (e.g. Series.order/Series.sort/DataFrame.sort/Index.order/Categorical.order). Its just better to get this over and done with and move on. Let's not let this linger.

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2015

also like to get this sorted soon (get it!!!!)

@jorisvandenbossche
cc @patricktokeeffe
@hayd

@hayd
Copy link
Contributor

hayd commented Aug 19, 2015

This looks great! +1 to deprecating to df.sort() et al. 🔥

What's the deal with the na_last (I'm sure I've used this for Series in the past). Ah, I see, it's replaced with na_position, cool.

@jreback
Copy link
Contributor Author

jreback commented Aug 21, 2015

ok, I am just going to merge this, and we'll see who whines :)

jreback added a commit that referenced this pull request Aug 21, 2015
@jreback jreback merged commit a6ee127 into pandas-dev:master Aug 21, 2015
@TomAugspurger
Copy link
Contributor

nice work!

@jorisvandenbossche
Copy link
Member

Yes, I am really glad we have a nicer interface now!

The reason I pressed for DeprecationWarning instead of FutureWarning for now, is that when you upgrade your pandas version and eg using seaborn, you will get a lot of warnings that come from seaborn and you cannot solve.
But it's good it is in master now, we can always change that easily if we get too much complaints.

@jreback
Copy link
Contributor Author

jreback commented Aug 22, 2015

@jorisvandenbossche ok, let's leave that as an item for the rc then. maybe will make that change if too much complaining.

@jorisvandenbossche jorisvandenbossche mentioned this pull request Aug 22, 2015
2 tasks
jreback added a commit that referenced this pull request Aug 24, 2015
fix some stacklevels on warnings
rgommers added a commit to rgommers/statsmodels that referenced this pull request Oct 26, 2015
Some failing tests in the previous commits because older ``pandas`` versions
don't have ``Series.sort_values``.  That method was only added in pandas 0.17,
in pandas-dev/pandas#10726
rgommers added a commit to rgommers/statsmodels that referenced this pull request Oct 28, 2015
Some failing tests in the previous commits because older ``pandas`` versions
don't have ``Series.sort_values``.  That method was only added in pandas 0.17,
in pandas-dev/pandas#10726
jreback pushed a commit that referenced this pull request Apr 14, 2016
Clarifies the meaning of 'sort' in the context of `Categorical` to
mean 'organization' rather than 'order',  as it is possible to call
this method (as well as `sort_values`) when the `Categorical` is
unordered.    Also patches a bug in `Categorical.sort_values` in which
`na_position` was not being respected when  `ascending` was set to
`True`. This commit aligns the behaviour with that of `Series`.
Finally, deprecates `sort` in favor of `sort_values`, which is in
alignment with what was done with `Series` back in #10726.

Closes #12785

Author: gfyoung <gfyoung17@gmail.com>

Closes #12882 from gfyoung/categorical-sort-doc and squashes the following commits:

f324a9c [gfyoung] BUG, DOC, DEP: Patch and Align Categorical's Sorting API
@chris-b1 chris-b1 mentioned this pull request Sep 22, 2016
gfyoung added a commit to forking-repos/pandas that referenced this pull request Mar 18, 2017
Affect classes:

1) Index
2) Series
2) DataFrame

xref pandas-devgh-10726
gfyoung added a commit to forking-repos/pandas that referenced this pull request Mar 19, 2017
Affect classes:

1) Index
2) Series
2) DataFrame

xref pandas-devgh-10726
gfyoung added a commit to forking-repos/pandas that referenced this pull request Mar 19, 2017
Affect classes:

1) Index
2) Series
2) DataFrame

xref pandas-devgh-10726
jreback pushed a commit that referenced this pull request Mar 19, 2017
Affect classes:

1) Index
2) Series
2) DataFrame

xref gh-10726
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: sorted() methods on everything API: unified sorting
8 participants