-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Make describe changes backwards compatible #34798
Conversation
Adds the new behavior as a feature flag / deprecation. Closes pandas-dev#33903
cc @david-cortes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add to the deprecation removal list as well for 2.0
result = df.describe(include="all", datetime_is_numeric=True) | ||
tm.assert_frame_equal(result, expected) | ||
|
||
s1_ = s1.describe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make as a separate test
@@ -98,3 +98,19 @@ def test_describe_with_tz(self, tz_naive_fixture): | |||
index=["count", "mean", "min", "25%", "50%", "75%", "max"], | |||
) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
with tm.assert_produces_warning(FutureWarning): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
|
BTW, some other remarks on the new behaviour: this should also be enabled in DataFrame.describe by default? Also, I would include "std" as well for datetimes, but with a NaN entry. That makes that the keys are always all the same for all numeric types, regardless of the exact dtype (and that also ensures the ordering is preserved when when doing describe on a dataframe with both numeric and datetime columns) |
I was surprised to see that master didn't treat datetimes as numeric in |
I think we should fix it, because if we add a keyword to enable the future behaviour, we should ensure we have the proper future behaviour we want (which is now not yet the case, IMO). Unless we disallow the keyword for dataframe .. |
Fixing that will get a bit messy, since it overlaps with the |
In the spirit of practicality over purity do we really need to do this? Outside of the Dask test do we expect end users would be relying on the old behavior? |
For timedelta that is clear, I think, since that also returns the numeric describe output, while period does not? |
It seems timedelta is alraedy included by default, on released version:
|
Hard to say. It's not hard to construct code that relies on the old behavior though. |
My point is that this seems like a lot of churn for potentially negligible value add. I can't think of a pipeline where the old behavior is actually useful, so I don't think worth adding a keyword that we subsequently plan to deprecate for the sake of maintaining compat |
There are a lot of things in pandas that I personally don't find particularly useful, but there are probably still a lot of people using those things. So I also find it hard to say whether it is important here. But it also doesn't seem difficult to actually do it with a deprecation. But whether we go through a deprecation or not, we still need to agree on what we think the new / future behaviour should be. |
Split the tests and fixed the merge conflicts. I'm happy with this as is since it restores the default 1.0 behavior. If we want additional changes then let's do them as followups. |
So I suppose this means you didn't change the dataframe behaviour (which is certainly fine to leave for a follow-up). But should be then maybe raise a NotImplementedError when specifying |
lgtm. @jorisvandenbossche comment here or in a followup ok |
@jorisvandenbossche can you remind me what the dataframe issue is? That datetimes are not included in this output while timedeltas are? In [4]: pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3]}).describe()
...:
Out[4]:
b
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
In [6]: pd.DataFrame({'a': pd.timedelta_range("2012", periods=3), 'b': [1, 2, 3]}).describe()
...:
Out[6]:
a b
count 3 3.0
mean 1 days 00:00:00.000002012 2.0
std 1 days 00:00:00 1.0
min 0 days 00:00:00.000002012 1.0
25% 0 days 12:00:00.000002012 1.5
50% 1 days 00:00:00.000002012 2.0
75% 1 days 12:00:00.000002012 2.5
max 2 days 00:00:00.000002012 3.0 Interestingly if you have datetime, numeric, and timedelta, then the timedelta is not included: In [5]: pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3], 'c': pd.period_range('2000', periods=3)}).describe()
...:
Out[5]:
b
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0 This all seems buggy, but I hope can be handled separately. |
Yes, that needs to be cleaned up and can indeed be handled separately. But the one thing that might be relevant for this PR:
is not actually doing what the keyword is expected to be doing (I suppose, didn't try with this branch). |
@jorisvandenbossche gotcha. Added a test and implemented that. In [3]: pd.DataFrame({'a': pd.date_range("2012", periods=3), 'b': [1, 2, 3]}).describe(datetime_is_numeric=True)
Out[3]:
a b
count 3 3.0
mean 2012-01-02 00:00:00 2.0
min 2012-01-01 00:00:00 1.0
25% 2012-01-01 12:00:00 1.5
50% 2012-01-02 00:00:00 2.0
75% 2012-01-02 12:00:00 2.5
max 2012-01-03 00:00:00 3.0
std NaN 1.0 |
hopefully fixed the conflict correctly. merging on green. |
Thanks. All green. |
Err, wait, the whatsnew doesn't look right. Will fix. |
cool |
Adds the new behavior as a feature flag / deprecation.
Closes #33903
(Do we have a list of issues for deprecations introduced in 1.x?)