Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: DataFrame.agg with axis=1, EA dtype, and duplicate index #42449

Merged
merged 3 commits into from
Jul 9, 2021

Conversation

rhshadrach
Copy link
Member

The issue in transpose was a bug that was introduced in 1.0.0, but induced a regression in 1.3.0 due to its use in DataFrame.agg.

@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version Apply Apply, Aggregate, Transform, Map labels Jul 8, 2021
@rhshadrach rhshadrach added this to the 1.3.1 milestone Jul 8, 2021
result = self._constructor(
dict(zip(self.index, new_values)), index=self.columns
values.T, index=self.columns, columns=self.index, dtype=dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values.T is going to cast to ndarray, which often means object

i think could use DataFrame.from_arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a MultiIndex.from_arrays, but not DataFrame. Here are some timings:

size = 1000
df = pd.DataFrame({'a': size * [1, 2, 3]}, dtype='category')
df_T = df.T

%timeit df.T
%timeit df_T.T

PR:

249 ms ± 6.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
92.3 ms ± 6.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Master:

237 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
89.7 ms ± 6.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so certainly a slight regression. Will continue looking to see if avoiding object is possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, found it - _from_arrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New timing with the update

234 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
85 ms ± 4.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@jreback jreback merged commit 3fb6d21 into pandas-dev:master Jul 9, 2021
@jreback
Copy link
Contributor

jreback commented Jul 9, 2021

thanks @rhshadrach

@jreback
Copy link
Contributor

jreback commented Jul 9, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jul 9, 2021

Something went wrong ... Please have a look at my logs.

jreback pushed a commit that referenced this pull request Jul 9, 2021
…plicate index (#42460)

Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>
@rhshadrach rhshadrach deleted the transpose_regression branch July 9, 2021 17:12
@simonjayhawkins
Copy link
Member

The issue in transpose was a bug that was introduced in 1.0.0

in pandas 0.25.3 the returned DataFrame had object dtype.

>>> pd.__version__
'0.25.3'
>>> 
>>> df = pd.DataFrame(
...     {"a": list("abcde"), "b": list("abcde")}, index=list("aabbc"), dtype="category"
... )
>>> 
>>> df.T.dtypes
a    object
a    object
b    object
b    object
c    object
dtype: object
>>> 

on master, we now have value dependent behaviour

>>> pd.__version__
'1.4.0.dev0+193.g87d785599e'
>>> 
>>> df = pd.DataFrame(
...     {"a": list("abcde"), "b": list("abcde")}, index=list("aabbc"), dtype="category"
... )
>>> 
>>> df.T.dtypes
a    category
a    category
b    category
b    category
c    category
dtype: object
>>> 
>>> df = pd.DataFrame(
...     {"a": list("abcde"), "b": list("fghij")}, index=list("aabbc"), dtype="category"
... )
>>> 
>>> df.T.dtypes
a    object
a    object
b    object
b    object
c    object
dtype: object
>>> 

@rhshadrach
Copy link
Member Author

@simonjayhawkins I think the column dtypes, not values, are what determines the resulting dtype. As such, I would not describe this as value-dependent behavior. The result looks like the correct one to me, but can open an issue if you would like this discussed further.

@simonjayhawkins
Copy link
Member

The result looks like the correct one to me

The result is now certainly consistent with the result for a unique index. (which also changed behavior between 0.25.3 and 1.0.0)

but can open an issue if you would like this discussed further.

this has been backported for a patch release. and we could release anytime dependent on number and severity of reported regression.

My preference is to not having an open issue, but to keep the 1.3.x ready for release.

Any reason not just to revert #40428 instead?

@rhshadrach
Copy link
Member Author

this has been backported for a patch release. and we could release anytime dependent on number and severity of reported regression.

My preference is to not having an open issue, but to keep the 1.3.x ready for release.

Are you saying we should not fix the transpose regression from 0.25.3 -> 1.0.0 in 1.3.1? If that's the case, then I would support reverting #40428, assuming it does fix the issue in agg. However, the code prior to #40428 used transpose as well, so I don't understand how it could fix it (but might be missing something).

@simonjayhawkins
Copy link
Member

Are you saying we should not fix the transpose regression from 0.25.3 -> 1.0.0 in 1.3.1?

I am not saying that. I basically just wanted clarification of why #40428 was not reverted. I assume that because the root cause was determined as the change in #30091, that was fixed instead?

However, that change does not revert to the behavior in 0.25.3 because of the dtype change. (but that is also reasonable as the new behavior matches that of a unique index)

If that's the case, then I would support reverting #40428, assuming it does fix the issue in agg.

it is ok to fix prior regressions or bug fixes in patch releases, so that alone is not a reason to prefer the revert option.

However, the code prior to #40428 used transpose as well, so I don't understand how it could fix it (but might be missing something).

that is odd.

@rhshadrach
Copy link
Member Author

Thanks @simonjayhawkins, your assessment is correct - the root cause of the agg regression was the transpose regression. Both are tested in this PR.

However, that change does not revert to the behavior in 0.25.3 because of the dtype change. (but that is also reasonable as the new behavior matches that of a unique index)

The behavior in 0.25.3 is undesirable. When a DataFrame has a single EA dtype, 0.25.3's transpose would convert it to object whereas it should remain the EA dtype. #30091 fixed this, but at the expense of silently dropping data when the index has duplicates. This PR maintains the desired behavior of #30091, but in a way that does not drop data when the index has duplicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: 1.3.0 DataFrame.agg over categorical columns with non-unique index returns wrong size result
4 participants