Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: .transform(...) with "first" and "last" fail when axis=1 #46074

Merged
merged 2 commits into from
Feb 26, 2022

Conversation

rhshadrach
Copy link
Member

Part of #45986 (first/last)

This removes warnings of dropping nuisance columns when using e.g. .transform("mean"). This makes it consistent with .agg, and in #46072 both transform and agg will warn about the default switching from numeric_only=False to numeric_only=True. Doing this first will make #46072 slightly easier.

Not falling back to _transform_item_by_item is also better for performance. For the benchmark, I disabled the warning being emitted in main to make sure that wasn't messing up the results.

size = 1_000_000
df = pd.DataFrame(
    {
        "A": size * ["foo", "bar"],
        "B": "one",
        "C": np.random.randn(2*size),
        "D": np.random.randn(2*size),
    }
)
%timeit df.groupby("A").transform("mean")

# This PR
108 ms ± 966 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# main, but with size set to 100_000 (1mm was taking very very long)
920 ms ± 7.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@rhshadrach rhshadrach added Bug Groupby Performance Memory or execution speed performance Clean Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Feb 19, 2022
@@ -418,45 +413,36 @@ def test_transform_select_columns(df):
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("duplicates", [True, False])
def test_transform_exclude_nuisance(df, duplicates):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to still hit the warning this was testing for, I needed to switch from a duplicated float column to a duplicated string column, as we no longer warn in the case of the duplicated float column. But using a string column would always fail with SeriesGroupBy, which is why duplicates=False was removed.

@jreback jreback added this to the 1.5 milestone Feb 26, 2022
@jreback jreback merged commit e932ec9 into pandas-dev:main Feb 26, 2022
@jreback
Copy link
Contributor

jreback commented Feb 26, 2022

thanks @rhshadrach

@rhshadrach rhshadrach deleted the transform_wrap_results branch February 27, 2022 15:30
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Clean Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants