Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Performance improvement on dataframe.update. #47407

Closed

Conversation

warm200
Copy link

@warm200 warm200 commented Jun 17, 2022

the setitem under the hood is doing df copy per #46267
therefore, the performance is compromised, per the suggestion, using loc for assignment instead
see the detail example in the issue link for metrics.

Tests added and passed](https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#writing-tests) if fixing a bug or adding a new feature

Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…opy under the hood and it's not updated in place
@warm200
Copy link
Author

warm200 commented Jun 17, 2022

@phofl would you mind taking look at this PR

@@ -8000,7 +8000,7 @@ def update(
if mask.all():
continue

self[col] = expressions.where(mask, this, that)
self.loc[:, col] = expressions.where(mask, this, that)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this writes into the underlying array, this is a change in behavior. Not sure if this is desireable.

Copy link
Author

@warm200 warm200 Jun 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl Thanks for replying. per docstring under update that's what it's doing.
Modify in place using non-NA values from another DataFrame. Aligns on indices. There is no return value.
it's just not doing in an efficient manner, since after 1.4.0 the setitem is being used for assignment, this needs to be optimized.
do you mind tagging other core contributor who you may think relavant to this code to take a look at this.
I really think this should be the way doing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl is right
this is an api change and cannot be back ported
nor is the right way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeated updates are not idiomatic

Copy link
Author

@warm200 warm200 Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for replying, I don't want to just jump in and change the API fundamentally but I want to understand you. the thing I am trying to fix is to speed up the operation. but didn't mean to have other impacts.
to me, df[col] = "a" vs df.loc[:, col]="a" doesn't have much difference but the latter is much faster. this is only thing I meant to change. as regards to repeated updates are not idiomatic, i think this is what the existing code is doing(in a for loop), and since docstring says it is to update it in place, isn't the change just right on its purpose? but the df[col] = "a" is doing copy under the hood and it also slow it down.

Would like to hear you thoughts.

@simonjayhawkins
Copy link
Member

@warm200 Thanks for the PR. #47327 has been merged which includes the change in this PR so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: DataFrame.update of pandas version is slower than the older version
4 participants