-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Performance improvement on dataframe.update. #47407
Conversation
…opy under the hood and it's not updated in place
@phofl would you mind taking look at this PR |
@@ -8000,7 +8000,7 @@ def update( | |||
if mask.all(): | |||
continue | |||
|
|||
self[col] = expressions.where(mask, this, that) | |||
self.loc[:, col] = expressions.where(mask, this, that) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this writes into the underlying array, this is a change in behavior. Not sure if this is desireable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl Thanks for replying. per docstring under update
that's what it's doing.
Modify in place using non-NA values from another DataFrame. Aligns on indices. There is no return value.
it's just not doing in an efficient manner, since after 1.4.0 the setitem is being used for assignment, this needs to be optimized.
do you mind tagging other core contributor who you may think relavant to this code to take a look at this.
I really think this should be the way doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phofl is right
this is an api change and cannot be back ported
nor is the right way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeated updates are not idiomatic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thanks for replying, I don't want to just jump in and change the API fundamentally but I want to understand you. the thing I am trying to fix is to speed up the operation. but didn't mean to have other impacts.
to me, df[col] = "a"
vs df.loc[:, col]="a"
doesn't have much difference but the latter is much faster. this is only thing I meant to change. as regards to repeated updates are not idiomatic
, i think this is what the existing code is doing(in a for loop), and since docstring says it is to update it in place, isn't the change just right on its purpose? but the df[col] = "a"
is doing copy under the hood and it also slow it down.
Would like to hear you thoughts.
the setitem under the hood is doing df copy per #46267
therefore, the performance is compromised, per the suggestion, using loc for assignment instead
see the detail example in the issue link for metrics.
Tests added and passed](https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#writing-tests) if fixing a bug or adding a new featureAdded type annotations to new arguments/methods/functions.Added an entry in the latestdoc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.