Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: revert behaviour change for concat with empty/all-NaN data #47372

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 15, 2022

Closes #45637
Closes #47284

Initially I tried it with a more selective revert of only the changes in #43577 and #43507, but then I ran into some other failures in existing tests. So in the end tried it with reverting all subsequent clean-up PRs as well. I assume quite some of those changes could be re-applied after this, but for now just ensuring the tests are passing.

@jorisvandenbossche jorisvandenbossche added this to the 1.4.3 milestone Jun 15, 2022
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review June 17, 2022 13:51
@jbrockmendel
Copy link
Member

The change we agreed upon reverting was for empty-object-dtype, not all-NA

@simonjayhawkins
Copy link
Member

The change we agreed upon reverting was for empty-object-dtype, not all-NA

does #47284 fall under the all-NA case? (you have proposed an alternative fix for that one)

@simonjayhawkins simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 17, 2022
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 17, 2022

The change we agreed upon reverting was for empty-object-dtype, not all-NA

In my head, the discussion has always been about both. While the reported issue itself was with an example using an empty dataframe, the discussion in #45637 has been about both cases (quoting a question from you: "would you only ignore the empty/all-NaN cases when they are object/float64, respectively?", on which I answered yes). EDIT: #46922 was also closed as a duplicate of #45637, and was about the all-NA case.
Unfortunately we don't have notes from the meeting in March where we discussed this, but at the time Simon concluded on the issue "We are agreed that we want to revert #43507 for 1.4.x" (#45637 (comment)). And the PR was about both empty and all-NA.

Both the empty and all-NA case are causing regressions, and both can be done with a deprecation instead if we want.

@@ -755,3 +756,49 @@ def test_concat_retain_attrs(data):
df2.attrs = {1: 1}
df = concat([df1, df2])
assert df.attrs[1] == 1


@td.skip_array_manager_invalid_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this invalid for AM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the AM has different behaviour, see last sentence in #40893 (AFAIR a consequence of using concat_compat, which is basically the Series behaviour, which is also known to be inconsistent, xref #39122).
The revert of the original PR also introduced some other related test changes, see eg https://github.com/pandas-dev/pandas/pull/47372/files#diff-50f6426546495aad672032deb56b5b222b35697637f5a7f0353ec8ff33bd8ca5R187. And see also the last part of the comment at #43507 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be array_manager_not_implemented then? or just specify a different expected?

@jbrockmendel
Copy link
Member

does #47284 fall under the all-NA case? (you have proposed an alternative fix for that one)

i believe so, yes

@jbrockmendel
Copy link
Member

Unfortunately we don't have notes from the meeting in March where we discussed this

We specifically discussed object-empty. Performance considerations in avoiding object dtype were the main reason we agreed on allowing this value-dependent behavior.

@simonjayhawkins
Copy link
Member

does #47284 fall under the all-NA case? (you have proposed an alternative fix for that one)

i believe so, yes

so slightly off-topic for the PR here but the discussion in #45637 (comment) there was the suggestion that only the all-NA case with float64 dtype would be special cased which would not catch the pd.NA series with object dtype.

@simonjayhawkins
Copy link
Member

Thanks @jorisvandenbossche lgtm except failing Data Manager test and missing release note (probably needs more than a one-liner and leave out any mention of future behavior until agreed).

Initially I tried it with a more selective revert of only the changes in #43577 and #43507, but then I ran into some other failures in existing tests. So in the end tried it with reverting all subsequent clean-up PRs as well. I assume quite some of those changes could be re-applied after this, but for now just ensuring the tests are passing.

I also tried to be selective in an attempt to do this #45637 (comment) and found it not so straightforward so agree that this is probably the best way of going about this.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jun 17, 2022

We specifically discussed object-empty.

@jbrockmendel then your recollection of it is different than mine. In any case, I think the discussion in #45637 in days leading up to the meeting was about both.

Performance considerations in avoiding object dtype were the main reason we agreed on allowing this value-dependent behavior.

Also the all-NaN case can result in object dtype (for non-numerical dtypes).

the discussion in #45637 (comment) there was the suggestion that only the all-NA case with float64 dtype would be special cased which would not catch the pd.NA series with object dtype.

@simonjayhawkins if you are talking about the third paragraph in that comment, then I think it is the other way around: I mentioned eventually only ignoring object-dtype all-NA. But with emphasis on "eventually": that's what I think might be a good idea in the future (a next major release) if we have better defaults for empty data that no longer uses float64 dtype for this.

@jbrockmendel
Copy link
Member

@jorisvandenbossche regardless of our differing recollections about what we had consensus on, it is clear that now we only have consensus on the empty-object case.

The other thing we agreed on was looking to achieve consistency between DataFrame/Series/Index/EA behavior, which this actively moves away from.

Note also: https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v1.4.0.rst#ignoring-dtypes-in-concat-with-empty-or-all-na-columns

Please do not self-merge.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Jun 17, 2022

Note also: https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v1.4.0.rst#ignoring-dtypes-in-concat-with-empty-or-all-na-columns

in the past we have added an update to the previous release note (but only for the current released docs) now we have the version switcher for the docs, people maybe viewing older docs more.

@jorisvandenbossche
Copy link
Member Author

now we have the version switcher for the docs, people maybe viewing older docs more.

I think that should be fine, since we only show a single (the latest) version per release branch in the dropdown. So once 1.4.3 is released, the "1.4" docs would show the docs of that version, and the versions for 1.4.0-1.4.2 are not accessible from the dropdown.

tm.assert_frame_equal(result, expected)


@td.skip_array_manager_invalid_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

tm.assert_frame_equal(result, expected)


@td.skip_array_manager_invalid_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche can you respond here and potentially follow up

@simonjayhawkins
Copy link
Member

@jorisvandenbossche regardless of our differing recollections about what we had consensus on, it is clear that now we only have consensus on the empty-object case.

Both are behavior changes and not strictly bug fixes so as it's a grey area. I think in cases where it is not clearly a bug fix and there is disagreement we should not make the changes without deprecation.

So if we can agree that the status quo should have taken priority, rather than needing consensus, we need to get back to the old behavior and therefore have to revert both at this time?

@jbrockmendel
Copy link
Member

Both are behavior changes and not strictly bug fixes so as it's a grey area. I think in cases where it is not clearly a bug fix and there is disagreement we should not make the changes without deprecation.

@jreback and I agreed at the time to consider it a bugfix. im not wild about the precedent of second-guessing that, especially after the release.

That said, this isn't a hill worth dying on. Go ahead and we can re-fix it for 2.0.

@simonjayhawkins
Copy link
Member

@jreback and I agreed at the time to consider it a bugfix. im not wild about the precedent of second-guessing that, especially after the release.

I understand but i'm sure I've seen @jorisvandenbossche mention somewhere that the behavior was intentional. So that implies to me it was not a bug.

@jbrockmendel
Copy link
Member

AFAICT it was "intentional" in that it was needed to make DataFrame.append work bc DataFrame.append used to do a .reindex that was made unnecessary. But again, not a hill.

@simonjayhawkins simonjayhawkins mentioned this pull request Jun 21, 2022
@simonjayhawkins
Copy link
Member

This passed locally when resolving the conflicts. (another merge of main since then). So should be good to merge if green(ish) (typing failures are on other PRs)

@simonjayhawkins
Copy link
Member

The other failing tests were failing when #47377 was merged. so to make sure we don't miss anything will merge main once #47462 is merged.

@simonjayhawkins simonjayhawkins merged commit d43d6e2 into pandas-dev:main Jun 22, 2022
@lumberbot-app

This comment was marked as resolved.

@simonjayhawkins
Copy link
Member

Thanks @jorisvandenbossche

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Jun 22, 2022
simonjayhawkins added a commit that referenced this pull request Jun 22, 2022
… concat with empty/all-NaN data) (#47472)

Backport PR #47372: REGR: revert behaviour change for concat with empty/all-NaN data

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche deleted the regr-concat-empty-2 branch December 29, 2022 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
3 participants