Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679

Closed
rchurt opened this issue Mar 21, 2016 · 11 comments · Fixed by #39009
Closed

ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679

rchurt opened this issue Mar 21, 2016 · 11 comments · Fixed by #39009
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement
Milestone

Comments

@rchurt
Copy link

rchurt commented Mar 21, 2016

Hello,

I'm trying to make a new DataFrame that contains the value counts of a column of an existing DataFrame (spreadsheet.xlsx), but I want the rows in the new DataFrame to be in the same order as the old one.

When I do:

import pandas as pd
df = pd.read_excel('./spreadsheet.xlsx')
print(df[0].value_counts(sort=False))

I get the DataFrame:

h4  8
ct1 6
f2  2
s1  2
EST2    2
f5  2
E4  8
h2  8
hd2 7
f3  2
ART1    2
s2  2
f1  2
h3  8
EST1    2
s3  2
E6  8
ART2    2
DGT2    2
ct2 6
s4  2
ct3 6
f4  2
DGT1    2
s5  2

When what I really want is:

h2  8
h3  8
h4  8
hd2 7
E4  8
E6  8
ct1 6
.
.
.

...because that's the order in which the values occur in the original DataFrame.

I can't tell how it's sorting them, but it is somehow. Is this the expected behavior?

Thanks

Installed versions:
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2016.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
@jreback
Copy link
Contributor

jreback commented Mar 21, 2016

.value_counts() returns sorted by the values by default, otherwise its in an arbitrary ordering. If you want to preserve orderings while counting, you could do something like this.

In [16]: df[0].groupby(df[0], sort=False).count()
Out[16]: 
0
h2      8
h3      8
h4      8
hd2     7
E4      8
E6      8
ct1     6
ct2     6
ct3     6
ART1    2
ART2    2
DGT1    2
DGT2    2
EST1    2
EST2    2
f1      2
f2      2
f3      2
f4      2
f5      2
s1      2
s2      2
s3      2
s4      2
s5      2
dtype: int64

@jreback jreback closed this as completed Mar 21, 2016
@jreback jreback added Usage Question Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 21, 2016
@jorisvandenbossche
Copy link
Member

@jreback value_counts has a sort argument, which the OP used. Shouldn't this give the same result as your groupby approach ? (in any case, otherwise it is a very confusing argument)

@jreback
Copy link
Contributor

jreback commented Mar 21, 2016

no this is an arbitrary order, subject to how the hashtable works. Its not a guarantee or ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.

I think this could do what you want (e.g. original ordering)

In [4]: df[0].value_counts(sort=False).reindex(pd.unique(df[0]))
Out[4]: 
h2      8
h3      8
h4      8
hd2     7
E4      8
E6      8
ct1     6
ct2     6
ct3     6
ART1    2
ART2    2
DGT1    2
DGT2    2
EST1    2
EST2    2
f1      2
f2      2
f3      2
f4      2
f5      2
s1      2
s2      2
s3      2
s4      2
s5      2
Name: 0, dtype: int64

yeah, I would say we could change this to have sort=False mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).

@jreback jreback reopened this Mar 21, 2016
@jreback jreback added Difficulty Intermediate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Usage Question labels Mar 21, 2016
@jreback jreback added this to the Next Major Release milestone Mar 21, 2016
@jreback jreback changed the title pandas.Series.value_counts "sort=False" argument not working ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering Mar 21, 2016
@jorisvandenbossche
Copy link
Member

I would say we could change this to have sort=False mean original ordering

+1

Have the order being dependent on the internals of the hashtable (and so also dependent of the dtype of your data) seems not very useful as an actual return value for sort=False

@kawochen
Copy link
Contributor

For anyone looking to tackle this, IIRC there isn't a lot of code sharing between Series.value_counts and GroupBy.value_counts, so the latter needs to be updated as well.

@OXPHOS
Copy link
Contributor

OXPHOS commented Sep 28, 2016

The order is changed from pandas/hashtable.pyx.build_count_table_object(). Resizing of the pymap moves the entries by hashing values. The best solution I can think of is adding an index list to save the original order of the input:

    index_list = []
    for value in values:
        if value is not in index_list:
            index_list.append(value)

and map the result to new lists from the order of the index list:

    ...
    kh_destroy_pymap(table)

    result_dict = dict(zip(result_keys, result_counts))
    result_ordered_keys = index_list
    result_ordered_counts = list[]
    for key in result_ordered_keys:
        result_ordered_counts.append(result_dict[key])

    return result_ordered_keys, result_ordered_counts

I am not sure whether it's worth to change or whether there's better way to solve the problem.

@jreback jreback modified the milestones: Next Minor Release, Next Major Release Mar 29, 2017
tomspur added a commit to tomspur/pandas that referenced this issue Apr 4, 2017
tomspur added a commit to tomspur/pandas that referenced this issue Apr 10, 2017
tomspur added a commit to tomspur/pandas that referenced this issue Apr 10, 2017
@jstray
Copy link

jstray commented May 31, 2017

+1 for maintaining the original order

@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
tomspur added a commit to tomspur/pandas that referenced this issue Dec 16, 2018
Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.

This fixes pandas-dev#12679.
tomspur added a commit to tomspur/pandas that referenced this issue Dec 16, 2018
Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.

This fixes pandas-dev#12679.
tomspur added a commit to tomspur/pandas that referenced this issue Dec 21, 2018
Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.

This fixes pandas-dev#12679.
tomspur added a commit to tomspur/pandas that referenced this issue Dec 21, 2018
Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.

This fixes pandas-dev#12679.
tomspur added a commit to tomspur/pandas that referenced this issue Dec 21, 2018
Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.

This fixes pandas-dev#12679.
tomspur added a commit to tomspur/pandas that referenced this issue Dec 23, 2018
Ensure that value_counts returns the same ordering of the indices than the input object
when sorting the values no matter if it is ascending or descending.

This fixes pandas-dev#12679.
@mroeschke mroeschke removed the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 13, 2020
@jreback
Copy link
Contributor

jreback commented Oct 6, 2020

is this closed by #32449 ?

@jreback
Copy link
Contributor

jreback commented Dec 31, 2020

pretty sure this is ok now that we have consistent hashing cc @realead if you want to have a look

@realead
Copy link
Contributor

realead commented Jan 1, 2021

@jreback

IIUC, in order to preserve the original order, hashmap needs to be insertion-ordered (like CPython's dicts for Py3.6+), but khash-maps aren't. Changes to hash-functions will not fix it.

It looks like there are at least two approaches at hand:

Second option seems to be a "more fundamental" fix and probably faster, if uniques aren't precalculated. However it still will lead to a performance decrease, which might be an issue.

A perfect solution would have following options for order of the output:

  • "sorted" (corresponds to the current sorted=True), overhead due to sorting
  • "arbitrary" (corresponds to the current sorted=False), no overhead, best performance
  • "insertion ordered" (correstponds to the proposed result of `sorted=False' in this issue), overhead to keep the insertion order.

I'm not sure how much the overhead for "insertion ordered" could be. Assuming that cache misses are the bottle-neck it could be up to 50%-100% slower.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

thanks for the analysis @realead . i don't think performance is a big deal here. I would opt for insertion order (via option 2) if its not too complicated (it sounds like we already have some of this so maybe would be fine). if that is not feasible then a fix, ala .unique would be fine too.

@jreback jreback modified the milestones: Contributions Welcome, 1.3 Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement
Projects
None yet
10 participants