REGR: revert behaviour change for concat with empty/all-NaN data #47372

jorisvandenbossche · 2022-06-15T18:56:28Z

Initially I tried it with a more selective revert of only the changes in #43577 and #43507, but then I ran into some other failures in existing tests. So in the end tried it with reverting all subsequent clean-up PRs as well. I assume quite some of those changes could be re-applied after this, but for now just ensuring the tests are passing.

This reverts commit bb9a985.

This reverts commit 0de6f8b.

This reverts commit 7036de3.

This reverts commit 4bb4b52.

This reverts commit eb643d7.

This reverts commit 95eb153.

…ns (pandas-dev#43507)" This reverts commit 084c543.

jbrockmendel · 2022-06-17T15:08:50Z

The change we agreed upon reverting was for empty-object-dtype, not all-NA

simonjayhawkins · 2022-06-17T15:23:39Z

The change we agreed upon reverting was for empty-object-dtype, not all-NA

does #47284 fall under the all-NA case? (you have proposed an alternative fix for that one)

jorisvandenbossche · 2022-06-17T15:27:45Z

The change we agreed upon reverting was for empty-object-dtype, not all-NA

In my head, the discussion has always been about both. While the reported issue itself was with an example using an empty dataframe, the discussion in #45637 has been about both cases (quoting a question from you: "would you only ignore the empty/all-NaN cases when they are object/float64, respectively?", on which I answered yes). EDIT: #46922 was also closed as a duplicate of #45637, and was about the all-NA case.
Unfortunately we don't have notes from the meeting in March where we discussed this, but at the time Simon concluded on the issue "We are agreed that we want to revert #43507 for 1.4.x" (#45637 (comment)). And the PR was about both empty and all-NA.

Both the empty and all-NA case are causing regressions, and both can be done with a deprecation instead if we want.

jbrockmendel · 2022-06-17T16:00:29Z

pandas/tests/reshape/concat/test_concat.py

@@ -755,3 +756,49 @@ def test_concat_retain_attrs(data):
    df2.attrs = {1: 1}
    df = concat([df1, df2])
    assert df.attrs[1] == 1
+
+
+@td.skip_array_manager_invalid_test


why is this invalid for AM?

Because the AM has different behaviour, see last sentence in #40893 (AFAIR a consequence of using concat_compat, which is basically the Series behaviour, which is also known to be inconsistent, xref #39122).
The revert of the original PR also introduced some other related test changes, see eg https://github.com/pandas-dev/pandas/pull/47372/files#diff-50f6426546495aad672032deb56b5b222b35697637f5a7f0353ec8ff33bd8ca5R187. And see also the last part of the comment at #43507 (comment)

should it be array_manager_not_implemented then? or just specify a different expected?

jbrockmendel · 2022-06-17T16:02:56Z

does #47284 fall under the all-NA case? (you have proposed an alternative fix for that one)

i believe so, yes

jbrockmendel · 2022-06-17T16:05:48Z

Unfortunately we don't have notes from the meeting in March where we discussed this

We specifically discussed object-empty. Performance considerations in avoiding object dtype were the main reason we agreed on allowing this value-dependent behavior.

simonjayhawkins · 2022-06-17T16:08:28Z

does #47284 fall under the all-NA case? (you have proposed an alternative fix for that one)

i believe so, yes

so slightly off-topic for the PR here but the discussion in #45637 (comment) there was the suggestion that only the all-NA case with float64 dtype would be special cased which would not catch the pd.NA series with object dtype.

simonjayhawkins · 2022-06-17T16:23:19Z

Thanks @jorisvandenbossche lgtm except failing Data Manager test and missing release note (probably needs more than a one-liner and leave out any mention of future behavior until agreed).

Initially I tried it with a more selective revert of only the changes in #43577 and #43507, but then I ran into some other failures in existing tests. So in the end tried it with reverting all subsequent clean-up PRs as well. I assume quite some of those changes could be re-applied after this, but for now just ensuring the tests are passing.

I also tried to be selective in an attempt to do this #45637 (comment) and found it not so straightforward so agree that this is probably the best way of going about this.

jorisvandenbossche · 2022-06-17T16:26:48Z

We specifically discussed object-empty.

@jbrockmendel then your recollection of it is different than mine. In any case, I think the discussion in #45637 in days leading up to the meeting was about both.

Performance considerations in avoiding object dtype were the main reason we agreed on allowing this value-dependent behavior.

Also the all-NaN case can result in object dtype (for non-numerical dtypes).

the discussion in #45637 (comment) there was the suggestion that only the all-NA case with float64 dtype would be special cased which would not catch the pd.NA series with object dtype.

@simonjayhawkins if you are talking about the third paragraph in that comment, then I think it is the other way around: I mentioned eventually only ignoring object-dtype all-NA. But with emphasis on "eventually": that's what I think might be a good idea in the future (a next major release) if we have better defaults for empty data that no longer uses float64 dtype for this.

jbrockmendel · 2022-06-17T16:39:09Z

@jorisvandenbossche regardless of our differing recollections about what we had consensus on, it is clear that now we only have consensus on the empty-object case.

The other thing we agreed on was looking to achieve consistency between DataFrame/Series/Index/EA behavior, which this actively moves away from.

Note also: https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v1.4.0.rst#ignoring-dtypes-in-concat-with-empty-or-all-na-columns

Please do not self-merge.

simonjayhawkins · 2022-06-17T16:49:35Z

Note also: https://github.com/pandas-dev/pandas/blob/main/doc/source/whatsnew/v1.4.0.rst#ignoring-dtypes-in-concat-with-empty-or-all-na-columns

in the past we have added an update to the previous release note (but only for the current released docs) now we have the version switcher for the docs, people maybe viewing older docs more.

jorisvandenbossche · 2022-06-17T16:55:27Z

now we have the version switcher for the docs, people maybe viewing older docs more.

I think that should be fine, since we only show a single (the latest) version per release branch in the dropdown. So once 1.4.3 is released, the "1.4" docs would show the docs of that version, and the versions for 1.4.0-1.4.2 are not accessible from the dropdown.

jbrockmendel · 2022-06-17T17:01:44Z

pandas/tests/reshape/concat/test_concat.py

+    tm.assert_frame_equal(result, expected)
+
+
+@td.skip_array_manager_invalid_test


jbrockmendel · 2022-06-17T17:01:56Z

pandas/tests/reshape/concat/test_concat.py

+    tm.assert_frame_equal(result, expected)
+
+
+@td.skip_array_manager_invalid_test


@jorisvandenbossche can you respond here and potentially follow up

simonjayhawkins · 2022-06-20T13:07:50Z

@jorisvandenbossche regardless of our differing recollections about what we had consensus on, it is clear that now we only have consensus on the empty-object case.

Both are behavior changes and not strictly bug fixes so as it's a grey area. I think in cases where it is not clearly a bug fix and there is disagreement we should not make the changes without deprecation.

So if we can agree that the status quo should have taken priority, rather than needing consensus, we need to get back to the old behavior and therefore have to revert both at this time?

jbrockmendel · 2022-06-20T13:51:58Z

Both are behavior changes and not strictly bug fixes so as it's a grey area. I think in cases where it is not clearly a bug fix and there is disagreement we should not make the changes without deprecation.

@jreback and I agreed at the time to consider it a bugfix. im not wild about the precedent of second-guessing that, especially after the release.

That said, this isn't a hill worth dying on. Go ahead and we can re-fix it for 2.0.

simonjayhawkins · 2022-06-20T14:02:28Z

@jreback and I agreed at the time to consider it a bugfix. im not wild about the precedent of second-guessing that, especially after the release.

I understand but i'm sure I've seen @jorisvandenbossche mention somewhere that the behavior was intentional. So that implies to me it was not a bug.

jbrockmendel · 2022-06-20T14:55:06Z

AFAICT it was "intentional" in that it was needed to make DataFrame.append work bc DataFrame.append used to do a .reindex that was made unnecessary. But again, not a hill.

simonjayhawkins · 2022-06-22T13:06:09Z

This passed locally when resolving the conflicts. (another merge of main since then). So should be good to merge if green(ish) (typing failures are on other PRs)

simonjayhawkins · 2022-06-22T13:43:08Z

The other failing tests were failing when #47377 was merged. so to make sure we don't miss anything will merge main once #47462 is merged.

simonjayhawkins · 2022-06-22T20:41:40Z

Thanks @jorisvandenbossche

…t with empty/all-NaN data

… concat with empty/all-NaN data) (#47472) Backport PR #47372: REGR: revert behaviour change for concat with empty/all-NaN data Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…das-dev#47372)

jorisvandenbossche added 8 commits June 15, 2022 20:03

Revert "REF: remove JoinUnit.shape (pandas-dev#43651)"

90e966e

This reverts commit bb9a985.

Revert "REF: concat on bm_axis==0 (pandas-dev#43626)"

b0231a6

This reverts commit 0de6f8b.

Revert "REF: pre-compute JoinUnit.needs_filling (pandas-dev#43590)"

e785ff6

This reverts commit 7036de3.

Revert "REF: implement make_na_array (pandas-dev#43606)"

b0fe1f0

This reverts commit 4bb4b52.

Revert "REF: avoid having 0 in JoinUnit.indexers (pandas-dev#43592)"

88fb277

This reverts commit eb643d7.

Revert "CLN: remove unused concat code (pandas-dev#43577)"

978c1eb

This reverts commit 95eb153.

Partial Revert "BUG/API: concat with empty DataFrames or all-NA colum…

cf095e1

…ns (pandas-dev#43507)" This reverts commit 084c543.

add test

f02bdb1

jorisvandenbossche added this to the 1.4.3 milestone Jun 15, 2022

jorisvandenbossche added 3 commits June 15, 2022 23:44

skip new tests for array manager

170931b

fix typing

5ed7dad

Merge remote-tracking branch 'upstream/main' into regr-concat-empty-2

577b329

jorisvandenbossche marked this pull request as ready for review June 17, 2022 13:51

add more tests

1c718bb

jorisvandenbossche force-pushed the regr-concat-empty-2 branch from d481c92 to 1c718bb Compare June 17, 2022 14:46

jorisvandenbossche mentioned this pull request Jun 17, 2022

Revert "REF: remove JoinUnit.shape (#43651)" #47406

Merged

simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 17, 2022

jbrockmendel reviewed Jun 17, 2022

View reviewed changes

add whatsnew

421ec5d

jbrockmendel reviewed Jun 17, 2022

View reviewed changes

simonjayhawkins mentioned this pull request Jun 21, 2022

RLS: 1.4.3 #46610

Closed

simonjayhawkins added 2 commits June 22, 2022 12:22

Merge remote-tracking branch 'upstream/main' into regr-concat-empty-2

584c15e

Merge branch 'main' into regr-concat-empty-2

2532a46

Merge branch 'main' into regr-concat-empty-2

33631bb

simonjayhawkins merged commit d43d6e2 into pandas-dev:main Jun 22, 2022

This comment was marked as resolved.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Jun 22, 2022

jbrockmendel mentioned this pull request Jun 22, 2022

REGR: assignment of pd.NA with enlargement gives object dtype with IntegerArray #47284

Closed

3 tasks

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Jun 22, 2022

Backport PR pandas-dev#47372: REGR: revert behaviour change for conca…

45a2421

…t with empty/all-NaN data

simonjayhawkins mentioned this pull request Jun 22, 2022

Backport PR #47372 on branch 1.4.x (REGR: revert behaviour change for concat with empty/all-NaN data) #47472

Merged

simonjayhawkins removed the Still Needs Manual Backport label Jun 22, 2022

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

REGR: revert behaviour change for concat with empty/all-NaN data (pan…

a5747fd

…das-dev#47372)

jorisvandenbossche mentioned this pull request Jul 17, 2022

REGR: preserve reindexed array object (instead of creating new array) for concat with all-NA array #47762

Merged

This was referenced Dec 22, 2022

REF: restore _concat_managers_axis0 #50401

Merged

API: concatting of Series/DataFrame - handling (not skipping) of empty objects #39122

Closed

jorisvandenbossche deleted the regr-concat-empty-2 branch December 29, 2022 19:14

pdemarti mentioned this pull request Mar 23, 2023

PERF: concat slow, manual concat through reindexing enhances performance #50652

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: revert behaviour change for concat with empty/all-NaN data #47372

REGR: revert behaviour change for concat with empty/all-NaN data #47372

jorisvandenbossche commented Jun 15, 2022 •

edited

Loading

jbrockmendel commented Jun 17, 2022

simonjayhawkins commented Jun 17, 2022

jorisvandenbossche commented Jun 17, 2022 •

edited

Loading

jbrockmendel Jun 17, 2022

jorisvandenbossche Jun 17, 2022

jbrockmendel Jun 17, 2022

jbrockmendel commented Jun 17, 2022

jbrockmendel commented Jun 17, 2022

simonjayhawkins commented Jun 17, 2022

simonjayhawkins commented Jun 17, 2022

jorisvandenbossche commented Jun 17, 2022 •

edited

Loading

jbrockmendel commented Jun 17, 2022

simonjayhawkins commented Jun 17, 2022 •

edited

Loading

jorisvandenbossche commented Jun 17, 2022

jbrockmendel Jun 17, 2022

jbrockmendel Jun 17, 2022

jbrockmendel Jun 22, 2022

simonjayhawkins commented Jun 20, 2022

jbrockmendel commented Jun 20, 2022

simonjayhawkins commented Jun 20, 2022

jbrockmendel commented Jun 20, 2022

simonjayhawkins commented Jun 22, 2022

simonjayhawkins commented Jun 22, 2022

This comment was marked as resolved.

simonjayhawkins commented Jun 22, 2022

		tm.assert_frame_equal(result, expected)


		@td.skip_array_manager_invalid_test

REGR: revert behaviour change for concat with empty/all-NaN data #47372

REGR: revert behaviour change for concat with empty/all-NaN data #47372

Conversation

jorisvandenbossche commented Jun 15, 2022 • edited Loading

jbrockmendel commented Jun 17, 2022

simonjayhawkins commented Jun 17, 2022

jorisvandenbossche commented Jun 17, 2022 • edited Loading

jbrockmendel Jun 17, 2022

Choose a reason for hiding this comment

jorisvandenbossche Jun 17, 2022

Choose a reason for hiding this comment

jbrockmendel Jun 17, 2022

Choose a reason for hiding this comment

jbrockmendel commented Jun 17, 2022

jbrockmendel commented Jun 17, 2022

simonjayhawkins commented Jun 17, 2022

simonjayhawkins commented Jun 17, 2022

jorisvandenbossche commented Jun 17, 2022 • edited Loading

jbrockmendel commented Jun 17, 2022

simonjayhawkins commented Jun 17, 2022 • edited Loading

jorisvandenbossche commented Jun 17, 2022

jbrockmendel Jun 17, 2022

Choose a reason for hiding this comment

jbrockmendel Jun 17, 2022

Choose a reason for hiding this comment

jbrockmendel Jun 22, 2022

Choose a reason for hiding this comment

simonjayhawkins commented Jun 20, 2022

jbrockmendel commented Jun 20, 2022

simonjayhawkins commented Jun 20, 2022

jbrockmendel commented Jun 20, 2022

simonjayhawkins commented Jun 22, 2022

simonjayhawkins commented Jun 22, 2022

This comment was marked as resolved.

simonjayhawkins commented Jun 22, 2022

jorisvandenbossche commented Jun 15, 2022 •

edited

Loading

jorisvandenbossche commented Jun 17, 2022 •

edited

Loading

jorisvandenbossche commented Jun 17, 2022 •

edited

Loading

simonjayhawkins commented Jun 17, 2022 •

edited

Loading