Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not cast .transform() output back to input dtype (closes #10972) #15256

Closed
wants to merge 1 commit into from
Closed

Do not cast .transform() output back to input dtype (closes #10972) #15256

wants to merge 1 commit into from

Conversation

nbonnotte
Copy link
Contributor

@nbonnotte nbonnotte commented Jan 29, 2017

Maybe need more tests, and check performance?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs perf (just run the asv for groupby) and see

@nbonnotte
Copy link
Contributor Author

Well, each time I run asv, I get different results, but after closing everything and leaving my computer alone, here is what I get:

   before     after       ratio
  [3853fe6d] [2bc47491]
+   19.84ms    22.47ms      1.13  groupby.GroupBySuite.time_diff('float', 100)
+   12.45ms    13.92ms      1.12  groupby.FirstLast.time_groupby_nth_none('float64')
-    1.97ms     1.78ms      0.90  groupby.GroupBySuite.time_size('int', 10000)
-   27.05ms    24.28ms      0.90  groupby.GroupBySuite.time_rank('float', 100)
-   19.79ms    17.21ms      0.87  groupby.FirstLast.time_groupby_nth_none('object')
-   18.12ms    15.64ms      0.86  groupby.FirstLast.time_groupby_nth_none('datetime')
-   10.96ms     9.37ms      0.86  groupby.FirstLast.time_groupby_last('datetime')

@jreback
Copy link
Contributor

jreback commented Jan 29, 2017

those are not the transform asv

@nbonnotte
Copy link
Contributor Author

I run asv continuous -f 1.1 -E virtualenv upstream/master HEAD -b ^groupby, but it's the first time I do perf, so I may have missed something :)

Is there something else I should do for this PR?

@nbonnotte
Copy link
Contributor Author

I think Appveyor is complaining about int being int32 on windows instead of int64. Is there a way to cast them to int64 instead?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change any tests; you have to maintain the existing unless their is a really good reason

tests breaking generally means something is wrong in your code

i suspect your changes are not propagating dtypes correctly

add the tests from the issue

@@ -1805,7 +1805,7 @@ def test_resample_median_bug_1688(self):
datetime(2012, 1, 1, 0, 5, 0)],
dtype=dtype)

result = df.resample("T").apply(lambda x: x.mean())
result = df.resample("T").mean()
exp = df.asfreq('T')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why r u changing tests like this?

@@ -5669,7 +5668,7 @@ def test_ops_general(self):
labels = np.random.randint(0, 50, size=1000).astype(float)

for op, targop in ops:
result = getattr(df.groupby(labels), op)().astype(float)
result = getattr(df.groupby(labels), op)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change tests

your changes are not robust

@codecov-io
Copy link

codecov-io commented Jan 29, 2017

Codecov Report

Merging #15256 into master will not impact coverage.

@@           Coverage Diff           @@
##           master   #15256   +/-   ##
=======================================
  Coverage   86.33%   86.33%           
=======================================
  Files         139      139           
  Lines       51149    51149           
=======================================
  Hits        44157    44157           
  Misses       6992     6992
Impacted Files Coverage Δ
pandas/core/groupby.py 95.15% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c26e5bb...ec52ab4. Read the comment docs.

@nbonnotte
Copy link
Contributor Author

@jreback Hum, this is more complex than I thought

The initial problem is: .transform cast output dtypes back to input dtype, so .size for instance produce timestamps if the initial data was timestamps

If I keep the output type as it is, then Series.mean, for instance, will produce float. So if the original data is float32, performing .groupby().apply(lambda g: g.mean()) might thus result in float64 output, but .groupby().mean() will stay in float32.

Am I missing something obvious? How to proceed from there?

@jreback
Copy link
Contributor

jreback commented Jan 29, 2017

you need to selectively cast

all lost every function should be cast except for things that produce ints (like size)

@nbonnotte
Copy link
Contributor Author

So .groupby().agg(lambda : float(len(x))) should keep producing timestamps if input is timestamps?

(That's how I stumbled on this issue when working on my PR #12607)

@jreback
Copy link
Contributor

jreback commented Jan 29, 2017

@nbonnotte of course not, that's the point of this issue!

you simply need to infer based on the return values, not automatically cast it.

you might be able to simply stick it into a Series and it will work, BUT, you will still need to cast if its a numeric type (IOW a float) to the original type flavor.

my points above is that you need to really look at the tests create code that DOESN't change anything except for specific tests (e.g. the test from the actual issue).

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Jan 29, 2017
@nbonnotte
Copy link
Contributor Author

Ok. I'm going to take it slow, starting with datetime input

@jreback
Copy link
Contributor

jreback commented Mar 2, 2017

this was fixed by 251826f

@jreback jreback closed this Mar 2, 2017
@jreback jreback added this to the 0.20.0 milestone Mar 2, 2017
@nbonnotte
Copy link
Contributor Author

It looks so simple once you see the solution... Thanks @jreback for your feedback and your patience, sorry I couldn't get that through

@nbonnotte nbonnotte deleted the 10972-wrong-cast branch March 3, 2017 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: output of a transform is cast to dtype of input
3 participants