ENH: Support mask in groupby cumprod #48138

phofl · 2022-08-18T21:14:43Z

closes ENH: support masked arrays in groupby cython algos #37493 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This is a general issue here. If we overflow int64 we get garbage. Previously we were working with float64, which gave us back numbers, but they were incorrect. But we keep precision as long as our numbers fit into int64, which was not the case previously, since we were casting to float64 beforehand, imo this is more important.

cc @jorisvandenbossche

# Conflicts: # pandas/core/groupby/ops.py # pandas/tests/groupby/test_groupby.py

# Conflicts: # doc/source/whatsnew/v1.6.0.rst

jorisvandenbossche · 2022-09-02T20:29:31Z

This is a general issue here. If we overflow int64 we get garbage. Previously we were working with float64, which gave us back numbers, but they were incorrect. But we keep precision as long as our numbers fit into int64, which was not the case previously, since we were casting to float64 beforehand, imo this is more important.

Comparing to the plain (non-grouped) sum/prod, those currently also overflow:

In [35]: pd.Series([int(1e16)]*100).sum()
Out[35]: 1000000000000000000

In [36]: pd.Series([int(1e16)]*1000).sum()
Out[36]: -8446744073709551616

In [40]: pd.Series([2]*62).prod()
Out[40]: 4611686018427387904

In [41]: pd.Series([2]*63).prod()
Out[41]: -9223372036854775808

So it seems sensible that the groupby variants follow this as well. In general, we should maybe better document those constraints and expectations around overflow (not sure if this is now documented somewhere?)

jorisvandenbossche · 2022-09-02T20:30:12Z

doc/source/whatsnew/v1.6.0.rst

@@ -100,6 +100,7 @@ Deprecations

 Performance improvements
 ~~~~~~~~~~~~~~~~~~~~~~~~
+- Performance improvement in :meth:`.GroupBy.cumprod` for extension array dtypes (:issue:`37493`)


This also now uses int64 instead of float64 for the numpy dtypes? So that also changes behaviour in those cases regarding overflow?

Yes, should we mention this in the whatsnew?

I think so, yes. Maybe as notable bug fix, as it has some behaviour change?

jorisvandenbossche · 2022-09-02T20:31:52Z

pandas/tests/groupby/test_function.py

@@ -641,10 +641,10 @@ def test_groupby_cumprod():
    tm.assert_series_equal(actual, expected)

    df = DataFrame({"key": ["b"] * 100, "value": 2})
+    df["value"] = df["value"].astype(float)


We can maybe keep this with as int (or test both in addition), so we have a test for the silent overflow behaviour?

Added a new test explicitly testing that overflow is consistent with numpy

jorisvandenbossche · 2022-09-02T20:32:00Z

pandas/tests/groupby/test_function.py

@@ -641,10 +641,10 @@ def test_groupby_cumprod():
    tm.assert_series_equal(actual, expected)

    df = DataFrame({"key": ["b"] * 100, "value": 2})
+    df["value"] = df["value"].astype(float)
    actual = df.groupby("key")["value"].cumprod()
    # if overflows, groupby product casts to float
    # while numpy passes back invalid values


This comment can probably be updated

# Conflicts: # doc/source/whatsnew/v1.6.0.rst # pandas/_libs/groupby.pyx

# Conflicts: # doc/source/whatsnew/v1.6.0.rst

phofl · 2022-09-12T18:05:52Z

So this is the last one of the groupby algos. We can start refactoring the groupby ops code paths after this is through

jorisvandenbossche · 2022-09-12T19:23:02Z

doc/source/whatsnew/v1.6.0.rst

+
+In previous versions we cast to float when applying ``cumsum`` and ``cumprod`` which
+lead to incorrect results even if the result could be hold by ``int64`` dtype.
+Additionally, the aggregation overflows consistent with numpy when the limit of


I would maybe mention that it is making it consistent with the DataFrame method as well? (without groupby)

Added a reference to the methods

jorisvandenbossche · 2022-09-12T20:39:39Z

I am still a bit uneasy about this change, since it is silently changing actual results that you get (a previously somewhat correct results (an inexact float) could silently become completely incorrect (overflowed int)).
So it would be good to get some input from others.

To what extent would it be possible to split the overflow behaviour change from the mask introduction, so we could for example leave that behaviour change for 2.0? (not sure myself whether this is worth it, just wondering)

phofl · 2022-09-12T21:01:14Z

It is a bit unfortunate, this is true. But we can preserve precision now if possible, this was buggy before and since the behaviour is aligned with numpy and the regular DataFrame behaviour this should be ok imo. In the end it probably does not matter how far off your values are if they are off.

We could cast to float before calling the algos, this would keep the current behaviour but would lose performance gains and the precision fixes (would also hit cumsum that is already merged).

Since we intend to do 2.0 as the next release anyways, would it be ok to merge this and revert to casting to float before passing the array to the cython algos, If we do an unexpected 1.6 next?

jorisvandenbossche · 2022-09-14T20:05:35Z

Since we intend to do 2.0 as the next release anyways, would it be ok to merge this and revert to casting to float before passing the array to the cython algos, If we do an unexpected 1.6 next?

Sounds good!

phofl · 2022-09-14T20:33:17Z

Great! Thanks.

@mroeschke Would you mind having a look before merging?

doc/source/whatsnew/v1.6.0.rst

pandas/_libs/groupby.pyx

# Conflicts: # doc/source/whatsnew/v1.6.0.rst

mroeschke · 2022-09-19T23:18:21Z

Thanks @phofl

* ENH: Support mask in groupby cumprod * Add whatsnew * Move whatsnew * Adress review * Fix example * Clarify * Change dtype access

phofl added 2 commits August 18, 2022 23:06

ENH: Support mask in groupby cumprod

5452698

Add whatsnew

af0c539

phofl added Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Aug 18, 2022

phofl added 4 commits August 20, 2022 14:47

Merge remote-tracking branch 'upstream/main' into groupby_cumprod_mask

cd4396d

Merge remote-tracking branch 'upstream/main' into groupby_cumprod_mask

3bfe8c3

# Conflicts: # pandas/core/groupby/ops.py # pandas/tests/groupby/test_groupby.py

Move whatsnew

e931b93

Merge remote-tracking branch 'upstream/main' into groupby_cumprod_mask

c82ed6b

# Conflicts: # doc/source/whatsnew/v1.6.0.rst

jorisvandenbossche reviewed Sep 2, 2022

View reviewed changes

phofl added 4 commits September 3, 2022 00:10

Adress review

781c678

Fix example

2476651

Merge remote-tracking branch 'upstream/main' into groupby_cumprod_mask

36a2edc

# Conflicts: # doc/source/whatsnew/v1.6.0.rst # pandas/_libs/groupby.pyx

Merge remote-tracking branch 'upstream/main' into groupby_cumprod_mask

39d5858

# Conflicts: # doc/source/whatsnew/v1.6.0.rst

jorisvandenbossche reviewed Sep 12, 2022

View reviewed changes

Clarify

fdfbf22

mroeschke reviewed Sep 15, 2022

View reviewed changes

doc/source/whatsnew/v1.6.0.rst Outdated Show resolved Hide resolved

mroeschke reviewed Sep 15, 2022

View reviewed changes

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

phofl added 2 commits September 15, 2022 22:47

Change dtype access

c6fc53c

Merge remote-tracking branch 'upstream/main' into groupby_cumprod_mask

459d225

# Conflicts: # doc/source/whatsnew/v1.6.0.rst

mroeschke added this to the 1.6 milestone Sep 19, 2022

mroeschke approved these changes Sep 19, 2022

View reviewed changes

mroeschke merged commit f19aeaf into pandas-dev:main Sep 19, 2022

phofl deleted the groupby_cumprod_mask branch September 20, 2022 08:22

jorisvandenbossche mentioned this pull request Sep 26, 2022

ENH: support masked arrays in groupby cython algos #37493

Closed

10 tasks

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

ENH: Support mask in groupby cumprod (pandas-dev#48138)

72e6d44

* ENH: Support mask in groupby cumprod * Add whatsnew * Move whatsnew * Adress review * Fix example * Clarify * Change dtype access

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support mask in groupby cumprod #48138

ENH: Support mask in groupby cumprod #48138

phofl commented Aug 18, 2022 •

edited

Loading

jorisvandenbossche commented Sep 2, 2022

jorisvandenbossche Sep 2, 2022

phofl Sep 2, 2022

jorisvandenbossche Sep 2, 2022

phofl Sep 2, 2022

jorisvandenbossche Sep 2, 2022

phofl Sep 2, 2022

jorisvandenbossche Sep 2, 2022

phofl Sep 2, 2022

phofl commented Sep 12, 2022

jorisvandenbossche Sep 12, 2022

phofl Sep 12, 2022

jorisvandenbossche commented Sep 12, 2022

phofl commented Sep 12, 2022

jorisvandenbossche commented Sep 14, 2022

phofl commented Sep 14, 2022

mroeschke commented Sep 19, 2022

ENH: Support mask in groupby cumprod #48138

ENH: Support mask in groupby cumprod #48138

Conversation

phofl commented Aug 18, 2022 • edited Loading

jorisvandenbossche commented Sep 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Sep 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 12, 2022

phofl commented Sep 12, 2022

jorisvandenbossche commented Sep 14, 2022

phofl commented Sep 14, 2022

mroeschke commented Sep 19, 2022

phofl commented Aug 18, 2022 •

edited

Loading