-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: revert behaviour change for concat with empty/all-NaN data #47372
REGR: revert behaviour change for concat with empty/all-NaN data #47372
Conversation
d481c92
to
1c718bb
Compare
The change we agreed upon reverting was for empty-object-dtype, not all-NA |
does #47284 fall under the all-NA case? (you have proposed an alternative fix for that one) |
In my head, the discussion has always been about both. While the reported issue itself was with an example using an empty dataframe, the discussion in #45637 has been about both cases (quoting a question from you: "would you only ignore the empty/all-NaN cases when they are object/float64, respectively?", on which I answered yes). EDIT: #46922 was also closed as a duplicate of #45637, and was about the all-NA case. Both the empty and all-NA case are causing regressions, and both can be done with a deprecation instead if we want. |
@@ -755,3 +756,49 @@ def test_concat_retain_attrs(data): | |||
df2.attrs = {1: 1} | |||
df = concat([df1, df2]) | |||
assert df.attrs[1] == 1 | |||
|
|||
|
|||
@td.skip_array_manager_invalid_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this invalid for AM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the AM has different behaviour, see last sentence in #40893 (AFAIR a consequence of using concat_compat
, which is basically the Series behaviour, which is also known to be inconsistent, xref #39122).
The revert of the original PR also introduced some other related test changes, see eg https://github.com/pandas-dev/pandas/pull/47372/files#diff-50f6426546495aad672032deb56b5b222b35697637f5a7f0353ec8ff33bd8ca5R187. And see also the last part of the comment at #43507 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be array_manager_not_implemented then? or just specify a different expected
?
i believe so, yes |
We specifically discussed object-empty. Performance considerations in avoiding object dtype were the main reason we agreed on allowing this value-dependent behavior. |
so slightly off-topic for the PR here but the discussion in #45637 (comment) there was the suggestion that only the all-NA case with float64 dtype would be special cased which would not catch the pd.NA series with object dtype. |
Thanks @jorisvandenbossche lgtm except failing Data Manager test and missing release note (probably needs more than a one-liner and leave out any mention of future behavior until agreed).
I also tried to be selective in an attempt to do this #45637 (comment) and found it not so straightforward so agree that this is probably the best way of going about this. |
@jbrockmendel then your recollection of it is different than mine. In any case, I think the discussion in #45637 in days leading up to the meeting was about both.
Also the all-NaN case can result in object dtype (for non-numerical dtypes).
@simonjayhawkins if you are talking about the third paragraph in that comment, then I think it is the other way around: I mentioned eventually only ignoring object-dtype all-NA. But with emphasis on "eventually": that's what I think might be a good idea in the future (a next major release) if we have better defaults for empty data that no longer uses float64 dtype for this. |
@jorisvandenbossche regardless of our differing recollections about what we had consensus on, it is clear that now we only have consensus on the empty-object case. The other thing we agreed on was looking to achieve consistency between DataFrame/Series/Index/EA behavior, which this actively moves away from. Please do not self-merge. |
in the past we have added an update to the previous release note (but only for the current released docs) now we have the version switcher for the docs, people maybe viewing older docs more. |
I think that should be fine, since we only show a single (the latest) version per release branch in the dropdown. So once 1.4.3 is released, the "1.4" docs would show the docs of that version, and the versions for 1.4.0-1.4.2 are not accessible from the dropdown. |
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
@td.skip_array_manager_invalid_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here?
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
@td.skip_array_manager_invalid_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche can you respond here and potentially follow up
Both are behavior changes and not strictly bug fixes so as it's a grey area. I think in cases where it is not clearly a bug fix and there is disagreement we should not make the changes without deprecation. So if we can agree that the status quo should have taken priority, rather than needing consensus, we need to get back to the old behavior and therefore have to revert both at this time? |
@jreback and I agreed at the time to consider it a bugfix. im not wild about the precedent of second-guessing that, especially after the release. That said, this isn't a hill worth dying on. Go ahead and we can re-fix it for 2.0. |
I understand but i'm sure I've seen @jorisvandenbossche mention somewhere that the behavior was intentional. So that implies to me it was not a bug. |
AFAICT it was "intentional" in that it was needed to make DataFrame.append work bc DataFrame.append used to do a .reindex that was made unnecessary. But again, not a hill. |
This passed locally when resolving the conflicts. (another merge of main since then). So should be good to merge if green(ish) (typing failures are on other PRs) |
This comment was marked as resolved.
This comment was marked as resolved.
Thanks @jorisvandenbossche |
…t with empty/all-NaN data
Closes #45637
Closes #47284
Initially I tried it with a more selective revert of only the changes in #43577 and #43507, but then I ran into some other failures in existing tests. So in the end tried it with reverting all subsequent clean-up PRs as well. I assume quite some of those changes could be re-applied after this, but for now just ensuring the tests are passing.