pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 #24953

benjaminr · 2019-01-26T17:56:25Z

closes pivot_table very slow on Categorical data; how about an observed keyword argument? #24923
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

… calls for categorical data.

jreback

you will be to use the observed fixture in some of the tests to validate the new behavior

doc/source/whatsnew/v0.24.0.rst

pandas/core/frame.py

codecov · 2019-01-26T19:02:22Z

Codecov Report

Merging #24953 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #24953   +/-   ##
=======================================
  Coverage   92.38%   92.38%           
=======================================
  Files         166      166           
  Lines       52398    52398           
=======================================
  Hits        48406    48406           
  Misses       3992     3992

Flag	Coverage Δ
#multiple	`90.8% <100%> (ø)`	⬆️
#single	`42.89% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.92% <ø> (ø)`	⬆️
pandas/core/reshape/pivot.py	`96.55% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95f8dca...0662fa3. Read the comment docs.

codecov · 2019-01-26T19:02:23Z

Codecov Report

Merging #24953 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24953      +/-   ##
==========================================
- Coverage   91.98%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52372    52372              
==========================================
- Hits        48176    48173       -3     
- Misses       4196     4199       +3

Flag	Coverage Δ
#multiple	`90.53% <100%> (ø)`	⬆️
#single	`40.7% <0%> (-0.16%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`96.9% <ø> (-0.12%)`	⬇️
pandas/core/reshape/pivot.py	`96.54% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/util/testing.py	`90.71% <0%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d74901b...3c1720c. Read the comment docs.

…ge version to docstring and updated correct whatsnew.

benjaminr · 2019-01-26T22:45:55Z

you will be to use the observed fixture in some of the tests to validate the new behavior

I've added the observed fixture to the existing pivot_table test - let me know if this is better placed elsewhere.

TomAugspurger · 2019-01-27T22:37:07Z

How far are we from a dict-encoded extension array? That would remove the need for an observed keyword.

…

________________________________ From: Benjamin Rowell <notifications@github.com> Sent: Saturday, January 26, 2019 4:45 PM To: pandas-dev/pandas Cc: Subscribed Subject: Re: [pandas-dev/pandas] pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 (#24953) you will be to use the observed fixture in some of the tests to validate the new behavior I've added the observed fixture to the existing pivot_table test - let me know if this is better placed elsewhere. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#24953 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIuCe2421uAcbU02yvlSgP5hd3d6Iks5vHNqmgaJpZM4aUSqx>.

jreback · 2019-01-30T13:03:09Z

How far are we from a dict-encoded extension array? That would remove the need for an observed keyword.

looks easy, but I suspect a) this will be optional to start, and b) need quite a bit of testing.

doc/source/whatsnew/v0.25.0.rst

pandas/tests/reshape/test_pivot.py

TomAugspurger · 2019-01-30T13:58:57Z

looks easy, but I suspect a) this will be optional to start, and b) need quite a bit of testing.

What do you mean by optional?

I'd just prefer to not add additional keywords that'll be made irrelevant soon (almost surely within a year I would think).

…est - this checks passing observed parameter remains equivalent to not passing.

jreback · 2019-01-31T12:39:17Z

I'd just prefer to not add additional keywords that'll be made irrelevant soon (almost surely within a year I would think).

we already have this for groupby and this totally makes sense. Even if you could guarantee it within a year I would still add this.

pandas/core/frame.py

pandas/tests/reshape/test_pivot.py

jreback · 2019-01-31T12:42:56Z

i don't think we have an asv for pivoting with categorical data, can you add one representative of this example.

…es so more explicit in tm assertion call.

…word.

pep8speaks · 2019-01-31T19:28:09Z

Hello @benjaminr! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-23 15:21:45 UTC

jreback · 2019-02-02T22:55:24Z

can you merge master. ping on green.

jreback · 2019-02-08T02:55:24Z

can you merge master again - odd its still failing

…vot_table_groupby_observed

benjaminr · 2019-02-08T21:37:50Z

@jreback yeah I'm not sure what's causing this either. Seems to have happened since I added the asv.

TomAugspurger · 2019-02-08T21:41:46Z

Is it the repr_html one that’s failing? I fixed that in another PR (on my phone so can’t link)

…

On Feb 8, 2019, at 15:37, Benjamin Rowell ***@***.***> wrote: @jreback yeah I'm not sure what's causing this either. Seems to have happened since I added the asv. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

pandas/tests/reshape/test_pivot.py

WillAyd · 2019-03-21T15:43:19Z

Right but the expected result doesn’t change depending on the argument, so this tests that pivot_table can accept the argument but doesn’t test that it actually ever makes a difference - does that make sense? What I’m asking for is a test where observed=True would produce a different result than observed=False so we can be more explicit about this working.

…

Sent from my iPhone

On Mar 21, 2019, at 12:47 AM, Benjamin Rowell ***@***.***> wrote: @benjaminr commented on this pull request. In pandas/tests/reshape/test_pivot.py: > @@ -65,6 +65,25 @@ def test_pivot_table(self): index + [columns])['D'].agg(np.mean).unstack() tm.assert_frame_equal(table, expected) + def test_pivot_table_categorical_observed(self, observed): + # issue #24923 + df = pd.DataFrame({'col1': list('abcde'), + 'col2': list('fghij'), + 'col3': [1, 2, 3, 4, 5]}) + + df.col1 = df.col1.astype('category') + df.col2 = df.col1.astype('category') + + expected = df.pivot_table(index='col1', values='col3', As far as I'm aware, it makes use of the pytest fixture for observed. When I execute the following test individually: pytest -s -v pandas/tests/reshape/test_pivot.py You see the test pass with all three scenarios. pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[True] PASSED pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[False] PASSED pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[None] PASSED — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

benjaminr · 2019-03-22T19:58:42Z

Right but the expected result doesn’t change depending on the argument, so this tests that pivot_table can accept the argument but doesn’t test that it actually ever makes a difference - does that make sense? What I’m asking for is a test where observed=True would produce a different result than observed=False so we can be more explicit about this working.
…
Sent from my iPhone
On Mar 21, 2019, at 12:47 AM, Benjamin Rowell @.***> wrote: @benjaminr commented on this pull request. In pandas/tests/reshape/test_pivot.py: > @@ -65,6 +65,25 @@ def test_pivot_table(self): index + [columns])['D'].agg(np.mean).unstack() tm.assert_frame_equal(table, expected) + def test_pivot_table_categorical_observed(self, observed): + # issue #24923 + df = pd.DataFrame({'col1': list('abcde'), + 'col2': list('fghij'), + 'col3': [1, 2, 3, 4, 5]}) + + df.col1 = df.col1.astype('category') + df.col2 = df.col1.astype('category') + + expected = df.pivot_table(index='col1', values='col3', As far as I'm aware, it makes use of the pytest fixture for observed. When I execute the following test individually: pytest -s -v pandas/tests/reshape/test_pivot.py You see the test pass with all three scenarios. pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[True] PASSED pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[False] PASSED pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[None] PASSED — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Yeah I see where you're coming from. So we need a test that benchmarks the two against each-other, as that's all observed should effect.

…eed faster than those which are set to False.

jreback · 2019-03-23T20:29:02Z

pandas/core/frame.py

+            If True: only show observed values for categorical groupers.
+            If False: show all values for categorical groupers.
+
+            .. versionchanged :: 0.25.0


versionadded

jreback · 2019-03-23T20:30:23Z

pandas/tests/reshape/test_pivot.py

+        df.col1 = df.col1.astype('category')
+        df.col2 = df.col1.astype('category')
+
+        start_time_observed_false = time.time()


why do you have time in here? these types of things are handled by the asv suite. i would remove this entire test.

I'm unsure how to otherwise address the test requirements @WillAyd highlighted. The only measurable outcome that observed changes is the time of execution. Have now removed.

Happy to listen to suggestions.

you have an asv result, yes? IOW if you timeit under master and then under the PR

@benjaminr right I wasn't asking for a test on timing as the ASV will handle that. I was asking for a test where the data makes a difference when this argument is supplied.

Right now the existing test gives the same result whether or not observed is True or False, so it doesn't test that adding this actually does anything for the result explicitly. Can you not come up with a test and data where the keyword would yield different results and test that explicitly?

Yes, I added an additional benchmark for this to asv_bench/benchmarks/reshape.py in 5c62063.

I've just added an additional benchmark where observed would default to False, with no kwarg being passed.

Master

[ 92.59%] ··· reshape.PivotTable.time_pivot_table_categorical 45.5±2ms [ 93.52%] ··· reshape.PivotTable.time_pivot_table_categorical_observed failed

Fails because the kwarg doesn't exist in the master branch yet.

feature/pivot_table_groupby_observed

[ 67.59%] ··· reshape.PivotTable.time_pivot_table_categorical 52.9±3ms [ 68.52%] ··· reshape.PivotTable.time_pivot_table_categorical_observed 19.1±1ms

Speed increase is demonstrated here.

…ata when observed is passed as True and when it defaults to False.

WillAyd · 2019-03-23T22:08:57Z

@benjaminr I am not referring to an ASV. I am asking for an actual test where observed=False and observed=True would generate different results. Does that make sense or are we just not on the same page?

benjaminr · 2019-03-23T22:57:17Z

@benjaminr I am not referring to an ASV. I am asking for an actual test where observed=False and observed=True would generate different results. Does that make sense or are we just not on the same page?

I know what you're saying, but I'm afraid I can't think of a test that would achieve what you want.

Happy for someone else to look at it.

WillAyd · 2019-03-24T05:44:08Z

@benjaminr ok so I could have the wrong expectation here. I took this example from the groupby tests:

pandas/pandas/tests/groupby/test_categorical.py

Line 262 in 6e0f9a9

cat1 = Categorical(["a", "a", "b", "b"],

    cat1 = Categorical(["a", "a", "b", "b"],
                       categories=["a", "b", "z"], ordered=True)
    cat2 = Categorical(["c", "d", "c", "d"],
                       categories=["c", "d", "y"], ordered=True)
    df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
    df['C'] = ['foo', 'bar'] * 2

So pivoting off of that on master yields:

In [8]: df.pivot_table(index='A', columns='B', values='values', aggfunc=np.sum)
Out[8]:
B  c  d
A
a  1  2
b  3  4

Which actually doesn't show any of the unobserved categories (in contrast to groupby behavior), so I don't think within the scope of this PR you could actually do anything to fix that.

So perhaps my hesitation here is that by accepting observed as a keyword argument to pivot we might be unintentionally signaling to users that they could get the cartesian production by passing observed=False, which actually isn't the case.

I see two solutions here:

Simply pass observed=True behind the scenes from pivot_table, removing it's exposure as a parameter OR
Having a follow up PR where observed=False would actually be supported through pivot_table

I'd prefer option 1 in this case but @jreback curious if you have any thoughts

jreback · 2019-04-05T00:53:05Z

can you merge master

…vot_table_groupby_observed

WillAyd · 2019-04-10T05:08:27Z

I think test errors here are unrelated but just want to follow up on my previous comment. Can we simply just not expose the observed argument at all to end users in this PR? I find it confusing from an API perspective

jreback · 2019-04-20T16:55:27Z

can you merge master

pandas/tests/reshape/test_pivot.py

…vot_table_groupby_observed

jreback · 2019-04-23T15:30:49Z

this lgtm. ping on green.

benjaminr · 2019-04-23T16:00:10Z

@jreback All green.

jreback · 2019-04-26T01:16:25Z

thanks for sticking with it @benjaminr

benjaminr added 2 commits January 26, 2019 17:23

Update to frame and pivot to accept observed kwarg to pass to groupby…

d1554c2

… calls for categorical data.

Addition of whatsnew entry.

0662fa3

jreback requested changes Jan 26, 2019

View reviewed changes

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved

pandas/core/frame.py Show resolved Hide resolved

Added change to pass observed fixture to pivot_table test, added chan…

6121313

…ge version to docstring and updated correct whatsnew.

jreback added this to the 0.25.0 milestone Jan 30, 2019

jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Jan 30, 2019

jreback requested changes Jan 30, 2019

View reviewed changes

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

pandas/tests/reshape/test_pivot.py Show resolved Hide resolved

Added "an" to whatsnew and added example from original issue to the t…

9f93ab9

…est - this checks passing observed parameter remains equivalent to not passing.

jreback requested changes Jan 31, 2019

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/tests/reshape/test_pivot.py Show resolved Hide resolved

pandas/tests/reshape/test_pivot.py Outdated Show resolved Hide resolved

pandas/tests/reshape/test_pivot.py Show resolved Hide resolved

benjaminr added 3 commits January 31, 2019 16:48

Removed unnecessary sentence.

ebe5972

Test separated into own test. Added issue comment. Updated df var nam…

416e9c8

…es so more explicit in tm assertion call.

Addition of asv for pivot_table of categorical data with observed key…

5c62063

…word.

Resolve PEP8 issue.

8663be2

benjaminr added 2 commits February 2, 2019 23:20

Merge in master from upstream.

a1e3afe

Minor adjustment to asv entry.

22637a3

Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…

088f277

…vot_table_groupby_observed

WillAyd approved these changes Mar 20, 2019

View reviewed changes

WillAyd requested changes Mar 20, 2019

View reviewed changes

pandas/tests/reshape/test_pivot.py Show resolved Hide resolved

Addition of test to ensure observed pivots on categorial data are ind…

09af30b

…eed faster than those which are set to False.

jreback requested changes Mar 23, 2019

View reviewed changes

benjaminr added 3 commits March 23, 2019 20:33

Removal of test that is otherwise handled by asv.

6df9e6d

Extra asv benchmark to see difference between pivots on categorical d…

a23b5d0

…ata when observed is passed as True and when it defaults to False.

Removal of import time.

8d50e85

Fix pep8 issue.

3d39dff

Benjamin added 2 commits April 5, 2019 07:24

Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…

ee696d9

…vot_table_groupby_observed

Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…

12c0f82

…vot_table_groupby_observed

benjaminr added 2 commits April 20, 2019 22:24

Manual merging of conflicts.

f586e42

Trailing whitespace fix.

cf7e8f5

jreback requested changes Apr 21, 2019

View reviewed changes

pandas/tests/reshape/test_pivot.py Show resolved Hide resolved

benjaminr added 4 commits April 21, 2019 22:09

Setting categorical datatype after calc on expected.

a3bcf1a

Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…

bb7cfef

…vot_table_groupby_observed

Empty commit to trigger CI.

5921646

Merge branch 'master' of github.com:pandas-dev/pandas into feature/pi…

3c1720c

…vot_table_groupby_observed

jreback approved these changes Apr 23, 2019

View reviewed changes

jreback merged commit 65466f0 into pandas-dev:master Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 #24953

pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 #24953

benjaminr commented Jan 26, 2019 •

edited by WillAyd

Loading

jreback left a comment

codecov bot commented Jan 26, 2019

codecov bot commented Jan 26, 2019 •

edited

Loading

benjaminr commented Jan 26, 2019

TomAugspurger commented Jan 27, 2019 via email

jreback commented Jan 30, 2019

TomAugspurger commented Jan 30, 2019

jreback commented Jan 31, 2019

jreback commented Jan 31, 2019

pep8speaks commented Jan 31, 2019 •

edited

Loading

jreback commented Feb 2, 2019

jreback commented Feb 8, 2019

benjaminr commented Feb 8, 2019

TomAugspurger commented Feb 8, 2019 via email

WillAyd commented Mar 21, 2019 via email

benjaminr commented Mar 22, 2019

jreback Mar 23, 2019

jreback Mar 23, 2019

benjaminr Mar 23, 2019

jreback Mar 23, 2019

WillAyd Mar 23, 2019

benjaminr Mar 23, 2019

WillAyd commented Mar 23, 2019

benjaminr commented Mar 23, 2019

WillAyd commented Mar 24, 2019 •

edited

Loading

jreback commented Apr 5, 2019

WillAyd commented Apr 10, 2019

jreback commented Apr 20, 2019

jreback commented Apr 23, 2019

benjaminr commented Apr 23, 2019 •

edited

Loading

jreback commented Apr 26, 2019

pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 #24953

pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 #24953

Conversation

benjaminr commented Jan 26, 2019 • edited by WillAyd Loading

jreback left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 26, 2019

Codecov Report

codecov bot commented Jan 26, 2019 • edited Loading

Codecov Report

benjaminr commented Jan 26, 2019

TomAugspurger commented Jan 27, 2019 via email

jreback commented Jan 30, 2019

TomAugspurger commented Jan 30, 2019

jreback commented Jan 31, 2019

jreback commented Jan 31, 2019

pep8speaks commented Jan 31, 2019 • edited Loading

Comment last updated at 2019-04-23 15:21:45 UTC

jreback commented Feb 2, 2019

jreback commented Feb 8, 2019

benjaminr commented Feb 8, 2019

TomAugspurger commented Feb 8, 2019 via email

WillAyd commented Mar 21, 2019 via email

benjaminr commented Mar 22, 2019

jreback Mar 23, 2019

Choose a reason for hiding this comment

jreback Mar 23, 2019

Choose a reason for hiding this comment

benjaminr Mar 23, 2019

Choose a reason for hiding this comment

jreback Mar 23, 2019

Choose a reason for hiding this comment

WillAyd Mar 23, 2019

Choose a reason for hiding this comment

benjaminr Mar 23, 2019

Choose a reason for hiding this comment

WillAyd commented Mar 23, 2019

benjaminr commented Mar 23, 2019

WillAyd commented Mar 24, 2019 • edited Loading

jreback commented Apr 5, 2019

WillAyd commented Apr 10, 2019

jreback commented Apr 20, 2019

jreback commented Apr 23, 2019

benjaminr commented Apr 23, 2019 • edited Loading

jreback commented Apr 26, 2019

benjaminr commented Jan 26, 2019 •

edited by WillAyd

Loading

codecov bot commented Jan 26, 2019 •

edited

Loading

pep8speaks commented Jan 31, 2019 •

edited

Loading

WillAyd commented Mar 24, 2019 •

edited

Loading

benjaminr commented Apr 23, 2019 •

edited

Loading