ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679

rchurt · 2016-03-21T02:38:57Z

Hello,

I'm trying to make a new DataFrame that contains the value counts of a column of an existing DataFrame (spreadsheet.xlsx), but I want the rows in the new DataFrame to be in the same order as the old one.

When I do:

import pandas as pd
df = pd.read_excel('./spreadsheet.xlsx')
print(df[0].value_counts(sort=False))

I get the DataFrame:

h4  8
ct1 6
f2  2
s1  2
EST2    2
f5  2
E4  8
h2  8
hd2 7
f3  2
ART1    2
s2  2
f1  2
h3  8
EST1    2
s3  2
E6  8
ART2    2
DGT2    2
ct2 6
s4  2
ct3 6
f4  2
DGT1    2
s5  2

When what I really want is:

h2  8
h3  8
h4  8
hd2 7
E4  8
E6  8
ct1 6
.
.
.

...because that's the order in which the values occur in the original DataFrame.

I can't tell how it's sorting them, but it is somehow. Is this the expected behavior?

Thanks

Installed versions:
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2016.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

jreback · 2016-03-21T11:57:20Z

.value_counts() returns sorted by the values by default, otherwise its in an arbitrary ordering. If you want to preserve orderings while counting, you could do something like this.

In [16]: df[0].groupby(df[0], sort=False).count()
Out[16]: 
0
h2      8
h3      8
h4      8
hd2     7
E4      8
E6      8
ct1     6
ct2     6
ct3     6
ART1    2
ART2    2
DGT1    2
DGT2    2
EST1    2
EST2    2
f1      2
f2      2
f3      2
f4      2
f5      2
s1      2
s2      2
s3      2
s4      2
s5      2
dtype: int64

jorisvandenbossche · 2016-03-21T12:15:06Z

@jreback value_counts has a sort argument, which the OP used. Shouldn't this give the same result as your groupby approach ? (in any case, otherwise it is a very confusing argument)

jreback · 2016-03-21T12:29:47Z

no this is an arbitrary order, subject to how the hashtable works. Its not a guarantee or ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.

I think this could do what you want (e.g. original ordering)

In [4]: df[0].value_counts(sort=False).reindex(pd.unique(df[0]))
Out[4]: 
h2      8
h3      8
h4      8
hd2     7
E4      8
E6      8
ct1     6
ct2     6
ct3     6
ART1    2
ART2    2
DGT1    2
DGT2    2
EST1    2
EST2    2
f1      2
f2      2
f3      2
f4      2
f5      2
s1      2
s2      2
s3      2
s4      2
s5      2
Name: 0, dtype: int64

yeah, I would say we could change this to have sort=False mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).

jorisvandenbossche · 2016-03-21T12:45:32Z

I would say we could change this to have sort=False mean original ordering

+1

Have the order being dependent on the internals of the hashtable (and so also dependent of the dtype of your data) seems not very useful as an actual return value for sort=False

kawochen · 2016-03-21T13:08:11Z

For anyone looking to tackle this, IIRC there isn't a lot of code sharing between Series.value_counts and GroupBy.value_counts, so the latter needs to be updated as well.

OXPHOS · 2016-09-28T04:24:55Z

The order is changed from pandas/hashtable.pyx.build_count_table_object(). Resizing of the pymap moves the entries by hashing values. The best solution I can think of is adding an index list to save the original order of the input:

    index_list = []
    for value in values:
        if value is not in index_list:
            index_list.append(value)

and map the result to new lists from the order of the index list:

    ...
    kh_destroy_pymap(table)

    result_dict = dict(zip(result_keys, result_counts))
    result_ordered_keys = index_list
    result_ordered_counts = list[]
    for key in result_ordered_keys:
        result_ordered_counts.append(result_dict[key])

    return result_ordered_keys, result_ordered_counts

I am not sure whether it's worth to change or whether there's better way to solve the problem.

This fixes pandas-dev#12679.

jstray · 2017-05-31T20:55:27Z

+1 for maintaining the original order

Ensure that value_counts returns the same ordering of the indices than the input object when sorting the values no matter if it is ascending or descending. This fixes pandas-dev#12679.

jreback · 2020-10-06T10:54:53Z

is this closed by #32449 ?

jreback · 2020-12-31T16:42:54Z

pretty sure this is ok now that we have consistent hashing cc @realead if you want to have a look

realead · 2021-01-01T22:18:14Z

@jreback

IIUC, in order to preserve the original order, hashmap needs to be insertion-ordered (like CPython's dicts for Py3.6+), but khash-maps aren't. Changes to hash-functions will not fix it.

It looks like there are at least two approaches at hand:

Postprocessing step involving pd.unique as proposed here or in Series.value_counts: Preserve original ordering #24302 to restore insertion order.

Using a similar approach to unique:

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Lines 424 to 427 in 1fc5efd

    
           def _unique(self, const {{dtype}}_t[:] values, {{name}}Vector uniques, 
        
                       Py_ssize_t count_prior=0, Py_ssize_t na_sentinel=-1, 
        
                       object na_value=None, bint ignore_na=False, 
        
                       object mask=None, bint return_inverse=False):

where elements are appended to a special data structure ({{name}}Vector uniques) in the right order, already in build_count_table_{{dtype}}(

pandas/pandas/_libs/hashtable_func_helper.pxi.in

Line 33 in 1fc5efd

cdef build_count_table_{{dtype}}({{dtype}}_t[:] values,

). The same idea as in this comment ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679 (comment), but using a data structure which is faster than Python's list.

Second option seems to be a "more fundamental" fix and probably faster, if uniques aren't precalculated. However it still will lead to a performance decrease, which might be an issue.

A perfect solution would have following options for order of the output:

"sorted" (corresponds to the current sorted=True), overhead due to sorting
"arbitrary" (corresponds to the current sorted=False), no overhead, best performance
"insertion ordered" (correstponds to the proposed result of `sorted=False' in this issue), overhead to keep the insertion order.

I'm not sure how much the overhead for "insertion ordered" could be. Assuming that cache misses are the bottle-neck it could be up to 50%-100% slower.

jreback · 2021-01-03T16:20:23Z

thanks for the analysis @realead . i don't think performance is a big deal here. I would opt for insertion order (via option 2) if its not too complicated (it sounds like we already have some of this so maybe would be fine). if that is not feasible then a fix, ala .unique would be fine too.

jreback closed this as completed Mar 21, 2016

jreback added Usage Question Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 21, 2016

jreback reopened this Mar 21, 2016

jreback added Difficulty Intermediate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Usage Question labels Mar 21, 2016

jreback added this to the Next Major Release milestone Mar 21, 2016

jreback changed the title ~~pandas.Series.value_counts "sort=False" argument not working~~ ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering Mar 21, 2016

OXPHOS mentioned this issue Sep 18, 2016

Pivot table drops column/index names=nan when dropna=false #14246

Closed

4 tasks

jreback mentioned this issue Mar 29, 2017

[Improvement] Deterministic value_counts #15833

Closed

jreback modified the milestones: Next Minor Release, Next Major Release Mar 29, 2017

tomspur added a commit to tomspur/pandas that referenced this issue Apr 4, 2017

Series.value_counts: Preserve original ordering

68dc76b

This fixes pandas-dev#12679.

tomspur added a commit to tomspur/pandas that referenced this issue Apr 10, 2017

Series.value_counts: Preserve original ordering

34451de

This fixes pandas-dev#12679.

tomspur added a commit to tomspur/pandas that referenced this issue Apr 10, 2017

Series.value_counts: Preserve original ordering

92f8dbc

This fixes pandas-dev#12679.

jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017

tomspur mentioned this issue Dec 16, 2018

Series.value_counts: Preserve original ordering #24302

Closed

5 tasks

gfyoung added the Enhancement label Dec 22, 2018

WillAyd mentioned this issue Jul 31, 2019

value_counts does not respect ordered categoricals #27670

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke removed the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 13, 2020

theemathas mentioned this issue Oct 6, 2020

BUG: Joining data frames with MultiIndex results in non-deterministic level order. #36910

Closed

3 tasks

realead mentioned this issue Jan 1, 2021

COMPAT: different orderings in value_counts on 32-bit platforms #11227

Closed

realead mentioned this issue Jan 6, 2021

ENH: making value_counts stable/keeping original ordering #39009

Merged

3 tasks

jreback modified the milestones: Contributions Welcome, 1.3 Jan 22, 2021

jreback closed this as completed in #39009 Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679

ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679

rchurt commented Mar 21, 2016

jreback commented Mar 21, 2016

jorisvandenbossche commented Mar 21, 2016

jreback commented Mar 21, 2016

jorisvandenbossche commented Mar 21, 2016

kawochen commented Mar 21, 2016

OXPHOS commented Sep 28, 2016

jstray commented May 31, 2017

jreback commented Oct 6, 2020

jreback commented Dec 31, 2020

realead commented Jan 1, 2021

jreback commented Jan 3, 2021

ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679

ENH: guarantee pandas.Series.value_counts "sort=False" to be original ordering #12679

Comments

rchurt commented Mar 21, 2016

jreback commented Mar 21, 2016

jorisvandenbossche commented Mar 21, 2016

jreback commented Mar 21, 2016

jorisvandenbossche commented Mar 21, 2016

kawochen commented Mar 21, 2016

OXPHOS commented Sep 28, 2016

jstray commented May 31, 2017

jreback commented Oct 6, 2020

jreback commented Dec 31, 2020

realead commented Jan 1, 2021

jreback commented Jan 3, 2021