Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 #24953

Merged

Conversation

benjaminr
Copy link
Contributor

@benjaminr benjaminr commented Jan 26, 2019

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will be to use the observed fixture in some of the tests to validate the new behavior

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jan 26, 2019

Codecov Report

Merging #24953 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #24953   +/-   ##
=======================================
  Coverage   92.38%   92.38%           
=======================================
  Files         166      166           
  Lines       52398    52398           
=======================================
  Hits        48406    48406           
  Misses       3992     3992
Flag Coverage Δ
#multiple 90.8% <100%> (ø) ⬆️
#single 42.89% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 96.92% <ø> (ø) ⬆️
pandas/core/reshape/pivot.py 96.55% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95f8dca...0662fa3. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 26, 2019

Codecov Report

Merging #24953 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24953      +/-   ##
==========================================
- Coverage   91.98%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52372    52372              
==========================================
- Hits        48176    48173       -3     
- Misses       4196     4199       +3
Flag Coverage Δ
#multiple 90.53% <100%> (ø) ⬆️
#single 40.7% <0%> (-0.16%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 96.9% <ø> (-0.12%) ⬇️
pandas/core/reshape/pivot.py 96.54% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/util/testing.py 90.71% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d74901b...3c1720c. Read the comment docs.

…ge version to docstring and updated correct whatsnew.
@benjaminr
Copy link
Contributor Author

you will be to use the observed fixture in some of the tests to validate the new behavior

I've added the observed fixture to the existing pivot_table test - let me know if this is better placed elsewhere.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 27, 2019 via email

@jreback
Copy link
Contributor

jreback commented Jan 30, 2019

How far are we from a dict-encoded extension array? That would remove the need for an observed keyword.

looks easy, but I suspect a) this will be optional to start, and b) need quite a bit of testing.

@jreback jreback added this to the 0.25.0 milestone Jan 30, 2019
@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Jan 30, 2019
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
pandas/tests/reshape/test_pivot.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

looks easy, but I suspect a) this will be optional to start, and b) need quite a bit of testing.

What do you mean by optional?

I'd just prefer to not add additional keywords that'll be made irrelevant soon (almost surely within a year I would think).

…est - this checks passing observed parameter remains equivalent to not passing.
@jreback
Copy link
Contributor

jreback commented Jan 31, 2019

I'd just prefer to not add additional keywords that'll be made irrelevant soon (almost surely within a year I would think).

we already have this for groupby and this totally makes sense. Even if you could guarantee it within a year I would still add this.

pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/tests/reshape/test_pivot.py Show resolved Hide resolved
pandas/tests/reshape/test_pivot.py Outdated Show resolved Hide resolved
pandas/tests/reshape/test_pivot.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Jan 31, 2019

i don't think we have an asv for pivoting with categorical data, can you add one representative of this example.

@pep8speaks
Copy link

pep8speaks commented Jan 31, 2019

Hello @benjaminr! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-23 15:21:45 UTC

@jreback
Copy link
Contributor

jreback commented Feb 2, 2019

can you merge master. ping on green.

@jreback
Copy link
Contributor

jreback commented Feb 8, 2019

can you merge master again - odd its still failing

@benjaminr
Copy link
Contributor Author

@jreback yeah I'm not sure what's causing this either. Seems to have happened since I added the asv.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 8, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Mar 21, 2019 via email

@benjaminr
Copy link
Contributor Author

Right but the expected result doesn’t change depending on the argument, so this tests that pivot_table can accept the argument but doesn’t test that it actually ever makes a difference - does that make sense? What I’m asking for is a test where observed=True would produce a different result than observed=False so we can be more explicit about this working.

Sent from my iPhone
On Mar 21, 2019, at 12:47 AM, Benjamin Rowell @.***> wrote: @benjaminr commented on this pull request. In pandas/tests/reshape/test_pivot.py: > @@ -65,6 +65,25 @@ def test_pivot_table(self): index + [columns])['D'].agg(np.mean).unstack() tm.assert_frame_equal(table, expected) + def test_pivot_table_categorical_observed(self, observed): + # issue #24923 + df = pd.DataFrame({'col1': list('abcde'), + 'col2': list('fghij'), + 'col3': [1, 2, 3, 4, 5]}) + + df.col1 = df.col1.astype('category') + df.col2 = df.col1.astype('category') + + expected = df.pivot_table(index='col1', values='col3', As far as I'm aware, it makes use of the pytest fixture for observed. When I execute the following test individually: pytest -s -v pandas/tests/reshape/test_pivot.py You see the test pass with all three scenarios. pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[True] PASSED pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[False] PASSED pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[None] PASSED — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Yeah I see where you're coming from. So we need a test that benchmarks the two against each-other, as that's all observed should effect.

…eed faster than those which are set to False.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.

.. versionchanged :: 0.25.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded

df.col1 = df.col1.astype('category')
df.col2 = df.col1.astype('category')

start_time_observed_false = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have time in here? these types of things are handled by the asv suite. i would remove this entire test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure how to otherwise address the test requirements @WillAyd highlighted. The only measurable outcome that observed changes is the time of execution. Have now removed.

Happy to listen to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have an asv result, yes? IOW if you timeit under master and then under the PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminr right I wasn't asking for a test on timing as the ASV will handle that. I was asking for a test where the data makes a difference when this argument is supplied.

Right now the existing test gives the same result whether or not observed is True or False, so it doesn't test that adding this actually does anything for the result explicitly. Can you not come up with a test and data where the keyword would yield different results and test that explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added an additional benchmark for this to asv_bench/benchmarks/reshape.py in 5c62063.

I've just added an additional benchmark where observed would default to False, with no kwarg being passed.

Master

[ 92.59%] ··· reshape.PivotTable.time_pivot_table_categorical                                                                             45.5±2ms
[ 93.52%] ··· reshape.PivotTable.time_pivot_table_categorical_observed                                                                      failed

Fails because the kwarg doesn't exist in the master branch yet.

feature/pivot_table_groupby_observed

[ 67.59%] ··· reshape.PivotTable.time_pivot_table_categorical                                                                             52.9±3ms
[ 68.52%] ··· reshape.PivotTable.time_pivot_table_categorical_observed                                                                    19.1±1ms

Speed increase is demonstrated here.

@WillAyd
Copy link
Member

WillAyd commented Mar 23, 2019

@benjaminr I am not referring to an ASV. I am asking for an actual test where observed=False and observed=True would generate different results. Does that make sense or are we just not on the same page?

@benjaminr
Copy link
Contributor Author

@benjaminr I am not referring to an ASV. I am asking for an actual test where observed=False and observed=True would generate different results. Does that make sense or are we just not on the same page?

I know what you're saying, but I'm afraid I can't think of a test that would achieve what you want.

Happy for someone else to look at it.

@WillAyd
Copy link
Member

WillAyd commented Mar 24, 2019

@benjaminr ok so I could have the wrong expectation here. I took this example from the groupby tests:

cat1 = Categorical(["a", "a", "b", "b"],

    cat1 = Categorical(["a", "a", "b", "b"],
                       categories=["a", "b", "z"], ordered=True)
    cat2 = Categorical(["c", "d", "c", "d"],
                       categories=["c", "d", "y"], ordered=True)
    df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
    df['C'] = ['foo', 'bar'] * 2

So pivoting off of that on master yields:

In [8]: df.pivot_table(index='A', columns='B', values='values', aggfunc=np.sum)
Out[8]:
B  c  d
A
a  1  2
b  3  4

Which actually doesn't show any of the unobserved categories (in contrast to groupby behavior), so I don't think within the scope of this PR you could actually do anything to fix that.

So perhaps my hesitation here is that by accepting observed as a keyword argument to pivot we might be unintentionally signaling to users that they could get the cartesian production by passing observed=False, which actually isn't the case.

I see two solutions here:

  1. Simply pass observed=True behind the scenes from pivot_table, removing it's exposure as a parameter OR
  2. Having a follow up PR where observed=False would actually be supported through pivot_table

I'd prefer option 1 in this case but @jreback curious if you have any thoughts

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

can you merge master

@WillAyd
Copy link
Member

WillAyd commented Apr 10, 2019

I think test errors here are unrelated but just want to follow up on my previous comment. Can we simply just not expose the observed argument at all to end users in this PR? I find it confusing from an API perspective

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

can you merge master

@jreback
Copy link
Contributor

jreback commented Apr 23, 2019

this lgtm. ping on green.

@benjaminr
Copy link
Contributor Author

benjaminr commented Apr 23, 2019

@jreback All green.

@jreback jreback merged commit 65466f0 into pandas-dev:master Apr 26, 2019
@jreback
Copy link
Contributor

jreback commented Apr 26, 2019

thanks for sticking with it @benjaminr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pivot_table very slow on Categorical data; how about an observed keyword argument?
6 participants