Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: fix case all-NaN/numeric object column in groupby #39655

Merged
merged 9 commits into from
Feb 8, 2021

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 7, 2021

Fixes #39329

This basically reverts #35841, although the code itself has changed quite a bit since then, so it's simpler here as the original PR² (not dealing with blocks anymore)

The original code comment that was removed in #35841 "We've split an object block!" was correct, as it can actually happen in those example cases.

@jorisvandenbossche jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version labels Feb 7, 2021
@jorisvandenbossche jorisvandenbossche changed the title Regr groupby object REGR: fix case all-NaN object column in groupby Feb 7, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.2.2 milestone Feb 7, 2021
@jorisvandenbossche jorisvandenbossche changed the title REGR: fix case all-NaN object column in groupby REGR: fix case all-NaN/numeric object column in groupby Feb 7, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will have a look but let's leave for 1.2.3


# unwrap DataFrame to get array
result = mgr.blocks[0].values
return result
mgr = result._mgr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this can you now just define as_array properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of simply consolidate first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already consolidated above, but if you have multiple dtypes, that doesn't help

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, as the diff seems to be larger (because I remove the assert above and moved the comment), but basically the only code addition is this:

            if len(mgr.blocks) != 1:
                return mgr.as_array()

to handle the case of multiple blocks that need to be converted into an array (while now this raised an AssertionError because of assert len(mgr.blocks) == 1

@jreback jreback modified the milestones: 1.2.2, 1.2.3 Feb 7, 2021
@jorisvandenbossche jorisvandenbossche modified the milestones: 1.2.3, 1.2.2 Feb 7, 2021
@jorisvandenbossche
Copy link
Member Author

This is a quite straightforward revert of #35841, a clean-up PR in which was said "we need to identify a test case in which this doesnt work". And so I have added a few tests that actually do result in multiple bocks in that code path.
Also, you can only end up in the new code for cases that would otherwise have led to an AssertionError, so I think this is safe to include for 1.2.2

@jreback
Copy link
Contributor

jreback commented Feb 7, 2021

we had this discussion before
about late PRs

appreciate them but it's too late
and 1.2.3 is appropriate

@jreback jreback modified the milestones: 1.2.2, 1.2.3 Feb 7, 2021
@jorisvandenbossche jorisvandenbossche modified the milestones: 1.2.3, 1.2.2 Feb 7, 2021
@jorisvandenbossche
Copy link
Member Author

@simonjayhawkins is not releasing today, so there is still a bit of time. Other PRs have been merged today as well. If it's not ready by tomorrow, then fine that it's for 1.2.3, but until then this can be assumed for 1.2.2 IMO (not as a blocker, but as "include if merged by releasing").

@jorisvandenbossche
Copy link
Member Author

cc @jbrockmendel the tests added here are example cases of how you can end up with multiple blocks in this code path. Basically if you have multiple object dtype columns (a single block originally), but some column is entirely numeric, after the aggregation we infer the numeric dtype, and so you can end up with columns of different dtypes (and thus multiple blocks).

dates = pd.date_range("2020-01-01", periods=15, freq="D")
df1 = DataFrame({"key": "A", "date": dates, "col1": range(15), "col_object": "val"})
df2 = DataFrame({"key": "B", "date": dates, "col1": range(15)})
df = pd.concat([df1, df2], ignore_index=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the consolidation status of df matter here? does it matter that df is not constructed directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter in this case. The dataframe that gets resampled (after the first groupby) seems to be consolidated anyway (I assume that creating the sub-dataframe to resample might incur consolidation?). But parametrized the test to cover both cases to be sure.

@jbrockmendel
Copy link
Member

but some column is entirely numeric

lets make sure to add this to the running list fo value-dependent behaviors

@jorisvandenbossche
Copy link
Member Author

Since multiple people reviewed this, I think this can be merged?

(and to repeat: this has no impact whatsoever on code that already worked before)

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback jreback merged commit e58a193 into pandas-dev:master Feb 8, 2021
@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

thanks @jorisvandenbossche

@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

@meeseeksdev backport 1.2.x

@lumberbot-app

This comment has been minimized.

@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

cc @simonjayhawkins @jorisvandenbossche for manual backport

@jorisvandenbossche jorisvandenbossche deleted the regr-groupby-object branch February 8, 2021 13:45
@jorisvandenbossche
Copy link
Member Author

but some column is entirely numeric

lets make sure to add this to the running list fo value-dependent behaviors

@jbrockmendel I think this is more a case of "we liberally try to infer object dtype" (although of course those are strictly also a kind of value dependent behaviour, but for object dtype we have plenty of that)

@simonjayhawkins
Copy link
Member

@jorisvandenbossche the conflict is

<<<<<<< HEAD
            assert len(result._mgr.blocks) == 1

            # unwrap DataFrame to get array
            result = result._mgr.blocks[0].values
            return result
=======
            mgr = result._mgr
            assert isinstance(mgr, BlockManager)

            # unwrap DataFrame to get array
            if len(mgr.blocks) != 1:
                # We've split an object block! Everything we've assumed
                # about a single block input returning a single block output
                # is a lie. See eg GH-39329
                return mgr.as_array()
            else:
                result = mgr.blocks[0].values
                return result
>>>>>>> e58a193408... REGR: fix case all-NaN/numeric object column in groupby  (#39655)

are you happy to accept the incoming changes here. will also include changes from #36010, specifically the assert isinstance(mgr, BlockManager)

@jorisvandenbossche
Copy link
Member Author

@simonjayhawkins I don't directly see the difference with the diff of this PR, the above seems exactly what I changed, so accepting the incoming changes in that conflict should be fine

@simonjayhawkins
Copy link
Member

I don't directly see the difference with the diff of this PR

should be more visible on #39677, on 1.2.x we don't have the mgr = result._mgr, but I think this is worth having to avoid the duplication, but also comes with an extra assert that was not on 1.2.x

jorisvandenbossche added a commit that referenced this pull request Feb 8, 2021
…roupby (#39677)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: Assertion error in 1.2.1 but not 1.0.5 with groupby, resample, min
4 participants