-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 #24953
pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 #24953
Conversation
… calls for categorical data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will be to use the observed fixture in some of the tests to validate the new behavior
Codecov Report
@@ Coverage Diff @@
## master #24953 +/- ##
=======================================
Coverage 92.38% 92.38%
=======================================
Files 166 166
Lines 52398 52398
=======================================
Hits 48406 48406
Misses 3992 3992
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24953 +/- ##
==========================================
- Coverage 91.98% 91.98% -0.01%
==========================================
Files 175 175
Lines 52372 52372
==========================================
- Hits 48176 48173 -3
- Misses 4196 4199 +3
Continue to review full report at Codecov.
|
…ge version to docstring and updated correct whatsnew.
I've added the observed fixture to the existing pivot_table test - let me know if this is better placed elsewhere. |
How far are we from a dict-encoded extension array? That would remove the need for an observed keyword.
…________________________________
From: Benjamin Rowell <notifications@github.com>
Sent: Saturday, January 26, 2019 4:45 PM
To: pandas-dev/pandas
Cc: Subscribed
Subject: Re: [pandas-dev/pandas] pivot_table very slow on Categorical data; how about an observed keyword argument? #24923 (#24953)
you will be to use the observed fixture in some of the tests to validate the new behavior
I've added the observed fixture to the existing pivot_table test - let me know if this is better placed elsewhere.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#24953 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIuCe2421uAcbU02yvlSgP5hd3d6Iks5vHNqmgaJpZM4aUSqx>.
|
looks easy, but I suspect a) this will be optional to start, and b) need quite a bit of testing. |
What do you mean by optional? I'd just prefer to not add additional keywords that'll be made irrelevant soon (almost surely within a year I would think). |
…est - this checks passing observed parameter remains equivalent to not passing.
we already have this for groupby and this totally makes sense. Even if you could guarantee it within a year I would still add this. |
i don't think we have an asv for pivoting with categorical data, can you add one representative of this example. |
…es so more explicit in tm assertion call.
Hello @benjaminr! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-04-23 15:21:45 UTC |
can you merge master. ping on green. |
can you merge master again - odd its still failing |
…vot_table_groupby_observed
@jreback yeah I'm not sure what's causing this either. Seems to have happened since I added the asv. |
Is it the repr_html one that’s failing? I fixed that in another PR (on my phone so can’t link)
… On Feb 8, 2019, at 15:37, Benjamin Rowell ***@***.***> wrote:
@jreback yeah I'm not sure what's causing this either. Seems to have happened since I added the asv.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Right but the expected result doesn’t change depending on the argument, so this tests that pivot_table can accept the argument but doesn’t test that it actually ever makes a difference - does that make sense?
What I’m asking for is a test where observed=True would produce a different result than observed=False so we can be more explicit about this working.
…Sent from my iPhone
On Mar 21, 2019, at 12:47 AM, Benjamin Rowell ***@***.***> wrote:
@benjaminr commented on this pull request.
In pandas/tests/reshape/test_pivot.py:
> @@ -65,6 +65,25 @@ def test_pivot_table(self):
index + [columns])['D'].agg(np.mean).unstack()
tm.assert_frame_equal(table, expected)
+ def test_pivot_table_categorical_observed(self, observed):
+ # issue #24923
+ df = pd.DataFrame({'col1': list('abcde'),
+ 'col2': list('fghij'),
+ 'col3': [1, 2, 3, 4, 5]})
+
+ df.col1 = df.col1.astype('category')
+ df.col2 = df.col1.astype('category')
+
+ expected = df.pivot_table(index='col1', values='col3',
As far as I'm aware, it makes use of the pytest fixture for observed.
When I execute the following test individually:
pytest -s -v pandas/tests/reshape/test_pivot.py
You see the test pass with all three scenarios.
pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[True] PASSED pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[False] PASSED pandas/tests/reshape/test_pivot.py::TestPivotTable::test_pivot_table_categorical_observed[None] PASSED
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Yeah I see where you're coming from. So we need a test that benchmarks the two against each-other, as that's all observed should effect. |
…eed faster than those which are set to False.
If True: only show observed values for categorical groupers. | ||
If False: show all values for categorical groupers. | ||
|
||
.. versionchanged :: 0.25.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded
pandas/tests/reshape/test_pivot.py
Outdated
df.col1 = df.col1.astype('category') | ||
df.col2 = df.col1.astype('category') | ||
|
||
start_time_observed_false = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you have time in here? these types of things are handled by the asv suite. i would remove this entire test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure how to otherwise address the test requirements @WillAyd highlighted. The only measurable outcome that observed changes is the time of execution. Have now removed.
Happy to listen to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have an asv result, yes? IOW if you timeit under master and then under the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjaminr right I wasn't asking for a test on timing as the ASV will handle that. I was asking for a test where the data makes a difference when this argument is supplied.
Right now the existing test gives the same result whether or not observed is True or False, so it doesn't test that adding this actually does anything for the result explicitly. Can you not come up with a test and data where the keyword would yield different results and test that explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added an additional benchmark for this to asv_bench/benchmarks/reshape.py in 5c62063.
I've just added an additional benchmark where observed would default to False, with no kwarg being passed.
Master
[ 92.59%] ··· reshape.PivotTable.time_pivot_table_categorical 45.5±2ms
[ 93.52%] ··· reshape.PivotTable.time_pivot_table_categorical_observed failed
Fails because the kwarg doesn't exist in the master branch yet.
feature/pivot_table_groupby_observed
[ 67.59%] ··· reshape.PivotTable.time_pivot_table_categorical 52.9±3ms
[ 68.52%] ··· reshape.PivotTable.time_pivot_table_categorical_observed 19.1±1ms
Speed increase is demonstrated here.
…ata when observed is passed as True and when it defaults to False.
@benjaminr I am not referring to an ASV. I am asking for an actual test where observed=False and observed=True would generate different results. Does that make sense or are we just not on the same page? |
I know what you're saying, but I'm afraid I can't think of a test that would achieve what you want. Happy for someone else to look at it. |
@benjaminr ok so I could have the wrong expectation here. I took this example from the groupby tests:
cat1 = Categorical(["a", "a", "b", "b"],
categories=["a", "b", "z"], ordered=True)
cat2 = Categorical(["c", "d", "c", "d"],
categories=["c", "d", "y"], ordered=True)
df = DataFrame({"A": cat1, "B": cat2, "values": [1, 2, 3, 4]})
df['C'] = ['foo', 'bar'] * 2 So pivoting off of that on master yields: In [8]: df.pivot_table(index='A', columns='B', values='values', aggfunc=np.sum)
Out[8]:
B c d
A
a 1 2
b 3 4 Which actually doesn't show any of the unobserved categories (in contrast to groupby behavior), so I don't think within the scope of this PR you could actually do anything to fix that. So perhaps my hesitation here is that by accepting I see two solutions here:
I'd prefer option 1 in this case but @jreback curious if you have any thoughts |
can you merge master |
…vot_table_groupby_observed
…vot_table_groupby_observed
I think test errors here are unrelated but just want to follow up on my previous comment. Can we simply just not expose the |
can you merge master |
…vot_table_groupby_observed
…vot_table_groupby_observed
this lgtm. ping on green. |
@jreback All green. |
thanks for sticking with it @benjaminr |
git diff upstream/master -u -- "*.py" | flake8 --diff