Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 #24953

Merged
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
d1554c2
Update to frame and pivot to accept observed kwarg to pass to groupby…
benjaminr Jan 26, 2019
0662fa3
Addition of whatsnew entry.
benjaminr Jan 26, 2019
6121313
Added change to pass observed fixture to pivot_table test, added chan…
benjaminr Jan 26, 2019
9f93ab9
Added "an" to whatsnew and added example from original issue to the t…
benjaminr Jan 30, 2019
ebe5972
Removed unnecessary sentence.
benjaminr Jan 31, 2019
416e9c8
Test separated into own test. Added issue comment. Updated df var nam…
benjaminr Jan 31, 2019
5c62063
Addition of asv for pivot_table of categorical data with observed key…
benjaminr Jan 31, 2019
8663be2
Resolve PEP8 issue.
benjaminr Jan 31, 2019
a1e3afe
Merge in master from upstream.
benjaminr Feb 2, 2019
22637a3
Minor adjustment to asv entry.
benjaminr Feb 2, 2019
088f277
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Feb 8, 2019
672847b
Merge branch 'master' into PR_TOOL_MERGE_PR_24953
jreback Feb 9, 2019
d97a077
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Feb 9, 2019
9de99fa
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Feb 12, 2019
9a9569f
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Feb 13, 2019
c8e085d
Triggering CI tests.
benjaminr Feb 13, 2019
2516386
Triggering CI tests - attempt 2.
benjaminr Feb 13, 2019
0efeed8
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Mar 2, 2019
13168d2
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Mar 2, 2019
8518833
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Mar 20, 2019
58a8f6e
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Mar 20, 2019
12b8fac
Triggering CI.
benjaminr Mar 20, 2019
09af30b
Addition of test to ensure observed pivots on categorial data are ind…
benjaminr Mar 22, 2019
6df9e6d
Removal of test that is otherwise handled by asv.
benjaminr Mar 23, 2019
a23b5d0
Extra asv benchmark to see difference between pivots on categorical d…
benjaminr Mar 23, 2019
8d50e85
Removal of import time.
benjaminr Mar 23, 2019
3d39dff
Fix pep8 issue.
benjaminr Mar 23, 2019
ee696d9
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
Apr 5, 2019
12c0f82
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
Apr 6, 2019
f586e42
Manual merging of conflicts.
benjaminr Apr 20, 2019
cf7e8f5
Trailing whitespace fix.
benjaminr Apr 20, 2019
a3bcf1a
Setting categorical datatype after calc on expected.
benjaminr Apr 21, 2019
bb7cfef
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Apr 22, 2019
5921646
Empty commit to trigger CI.
benjaminr Apr 22, 2019
3c1720c
Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…
benjaminr Apr 23, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions asv_bench/benchmarks/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,10 @@ def setup(self):
'value1': np.random.randn(N),
'value2': np.random.randn(N),
'value3': np.random.randn(N)})
self.df2 = DataFrame({'col1': list('abcde'), 'col2': list('fghij'),
'col3': [1, 2, 3, 4, 5]})
self.df2.col1 = self.df2.col1.astype('category')
self.df2.col2 = self.df2.col2.astype('category')

def time_pivot_table(self):
self.df.pivot_table(index='key1', columns=['key2', 'key3'])
Expand All @@ -139,6 +143,10 @@ def time_pivot_table_margins(self):
self.df.pivot_table(index='key1', columns=['key2', 'key3'],
margins=True)

def time_pivot_table_categorical(self):
self.df2.pivot_table(index='col1', values='col3', columns='col2',
aggfunc=np.sum, fill_value=0, observed=True)


class Crosstab(object):

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Other Enhancements
- Indexing of ``DataFrame`` and ``Series`` now accepts zerodim ``np.ndarray`` (:issue:`24919`)
- :meth:`Timestamp.replace` now supports the ``fold`` argument to disambiguate DST transition times (:issue:`25017`)
- :meth:`DataFrame.at_time` and :meth:`Series.at_time` now support :meth:`datetime.time` objects with timezones (:issue:`24043`)
- :meth:`DataFrame.pivot_table` now accepts an ``observed`` parameter which is passed to underlying calls to :meth:`DataFrame.groupby` to speed up grouping categorical data. (:issue:`24923`)
- ``Series.str`` has gained :meth:`Series.str.casefold` method to removes all case distinctions present in a string (:issue:`25405`)
- :meth:`DataFrame.set_index` now works for instances of ``abc.Iterator``, provided their output is of the same length as the calling frame (:issue:`22484`, :issue:`24984`)
- :meth:`DatetimeIndex.union` now supports the ``sort`` argument. The behaviour of the sort parameter matches that of :meth:`Index.union` (:issue:`24994`)
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5714,6 +5714,12 @@ def pivot(self, index=None, columns=None, values=None):
margins_name : string, default 'All'
Name of the row / column that will contain the totals
when margins is True.
observed : boolean, default False
This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
benjaminr marked this conversation as resolved.
Show resolved Hide resolved

.. versionchanged :: 0.25.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded


Returns
-------
Expand Down Expand Up @@ -5804,12 +5810,12 @@ def pivot(self, index=None, columns=None, values=None):
@Appender(_shared_docs['pivot_table'])
def pivot_table(self, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False,
dropna=True, margins_name='All'):
dropna=True, margins_name='All', observed=False):
from pandas.core.reshape.pivot import pivot_table
return pivot_table(self, values=values, index=index, columns=columns,
aggfunc=aggfunc, fill_value=fill_value,
margins=margins, dropna=dropna,
margins_name=margins_name)
margins_name=margins_name, observed=observed)

def stack(self, level=-1, dropna=True):
"""
Expand Down
7 changes: 4 additions & 3 deletions pandas/core/reshape/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
@Appender(_shared_docs['pivot_table'], indents=1)
def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
fill_value=None, margins=False, dropna=True,
margins_name='All'):
margins_name='All', observed=False):
index = _convert_by(index)
columns = _convert_by(columns)

Expand All @@ -35,7 +35,8 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
table = pivot_table(data, values=values, index=index,
columns=columns,
fill_value=fill_value, aggfunc=func,
margins=margins, margins_name=margins_name)
margins=margins, margins_name=margins_name,
observed=observed)
pieces.append(table)
keys.append(getattr(func, '__name__', func))

Expand Down Expand Up @@ -78,7 +79,7 @@ def pivot_table(data, values=None, index=None, columns=None, aggfunc='mean',
pass
values = list(values)

grouped = data.groupby(keys, observed=False)
grouped = data.groupby(keys, observed=observed)
agged = grouped.agg(aggfunc)
if dropna and isinstance(agged, ABCDataFrame) and len(agged.columns):
agged = agged.dropna(how='all')
Expand Down
51 changes: 47 additions & 4 deletions pandas/tests/reshape/test_pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from collections import OrderedDict
from datetime import date, datetime, timedelta
import time

import numpy as np
import pytest
Expand Down Expand Up @@ -38,18 +39,18 @@ def setup_method(self, method):
'E': np.random.randn(11),
'F': np.random.randn(11)})

def test_pivot_table(self):
def test_pivot_table(self, observed):
benjaminr marked this conversation as resolved.
Show resolved Hide resolved
index = ['A', 'B']
columns = 'C'
table = pivot_table(self.data, values='D',
index=index, columns=columns)
index=index, columns=columns, observed=observed)

table2 = self.data.pivot_table(
values='D', index=index, columns=columns)
values='D', index=index, columns=columns, observed=observed)
tm.assert_frame_equal(table, table2)

# this works
pivot_table(self.data, values='D', index=index)
pivot_table(self.data, values='D', index=index, observed=observed)

if len(index) > 1:
assert table.index.names == tuple(index)
Expand All @@ -65,6 +66,48 @@ def test_pivot_table(self):
index + [columns])['D'].agg(np.mean).unstack()
tm.assert_frame_equal(table, expected)

def test_pivot_table_categorical_observed_equal(self, observed):
# issue #24923
df = pd.DataFrame({'col1': list('abcde'),
benjaminr marked this conversation as resolved.
Show resolved Hide resolved
jreback marked this conversation as resolved.
Show resolved Hide resolved
'col2': list('fghij'),
'col3': [1, 2, 3, 4, 5]})

df.col1 = df.col1.astype('category')
df.col2 = df.col1.astype('category')

expected = df.pivot_table(index='col1', values='col3',
jreback marked this conversation as resolved.
Show resolved Hide resolved
columns='col2', aggfunc=np.sum,
fill_value=0)
benjaminr marked this conversation as resolved.
Show resolved Hide resolved

result = df.pivot_table(index='col1', values='col3',
columns='col2', aggfunc=np.sum,
fill_value=0, observed=observed)

tm.assert_frame_equal(result, expected)

def test_pivot_table_categorical_observed_speed(self):
# issue #24923
df = pd.DataFrame({'col1': list('abcde'),
'col2': list('fghij'),
'col3': [1, 2, 3, 4, 5]})

df.col1 = df.col1.astype('category')
df.col2 = df.col1.astype('category')

start_time_observed_false = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have time in here? these types of things are handled by the asv suite. i would remove this entire test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure how to otherwise address the test requirements @WillAyd highlighted. The only measurable outcome that observed changes is the time of execution. Have now removed.

Happy to listen to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have an asv result, yes? IOW if you timeit under master and then under the PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminr right I wasn't asking for a test on timing as the ASV will handle that. I was asking for a test where the data makes a difference when this argument is supplied.

Right now the existing test gives the same result whether or not observed is True or False, so it doesn't test that adding this actually does anything for the result explicitly. Can you not come up with a test and data where the keyword would yield different results and test that explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added an additional benchmark for this to asv_bench/benchmarks/reshape.py in 5c62063.

I've just added an additional benchmark where observed would default to False, with no kwarg being passed.

Master

[ 92.59%] ··· reshape.PivotTable.time_pivot_table_categorical                                                                             45.5±2ms
[ 93.52%] ··· reshape.PivotTable.time_pivot_table_categorical_observed                                                                      failed

Fails because the kwarg doesn't exist in the master branch yet.

feature/pivot_table_groupby_observed

[ 67.59%] ··· reshape.PivotTable.time_pivot_table_categorical                                                                             52.9±3ms
[ 68.52%] ··· reshape.PivotTable.time_pivot_table_categorical_observed                                                                    19.1±1ms

Speed increase is demonstrated here.

df.pivot_table(index='col1', values='col3',
columns='col2', aggfunc=np.sum,
fill_value=0, observed=False)
total_time_observed_false = time.time() - start_time_observed_false

start_time_observed_true = time.time()
df.pivot_table(index='col1', values='col3',
columns='col2', aggfunc=np.sum,
fill_value=0, observed=True)
total_time_observed_true = time.time() - start_time_observed_true

assert total_time_observed_true < total_time_observed_false

def test_pivot_table_nocols(self):
df = DataFrame({'rows': ['a', 'b', 'c'],
'cols': ['x', 'y', 'z'],
Expand Down