-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/pandas Performance Warning Issue due to multiple frame.insert
#4246
Conversation
columns are repeating. will close for now |
issue fixed. |
tested in python3.11.3 windows |
@nicolaskruchten CI is green, may I know what do you think about merging it? |
@@ -1064,14 +1062,14 @@ def _escape_col_name(df_input, col_name, extra): | |||
return col_name | |||
|
|||
|
|||
def to_unindexed_series(x): | |||
def to_unindexed_series(x, name=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why this change is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, it was creating series without a name. I set it as None in case this function was used externally to avoid breaking compatibility.
When we create a dataframe from a dict, it's safer to have named series. Also, it would be easier to debug in line to see which series was created in order to know the column that caused the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change might not be necessary, but i like the explicitness. Can be useful during debugging.
any update on this PR @nicolaskruchten |
I'm unavailable to review PRs at the moment, I'm sorry. I defer to @alexcjohnson. Can we get a clearer sense of the performance improvement here? Things go from 6 to 5.4 seconds when plotting one thousand columns in wide mode? This seems like a smallish improvement in a very rare case... |
The main thing is, it avoids the PerformanceWarning raised by pandas |
Any update on this @alexcjohnson |
@legendof-selda this looks great! No comments on your code edits but can you please just add a test, perhaps in test_px_wide.py, that verifies no warnings are emitted when running the code in your example? |
Sure I can do that |
At the bottom of plotly.py/packages/python/plotly/plotly/tests/test_optional/test_px/test_px_wide.py Lines 835 to 849 in da860db
that look conceptually similar to your code above ("To reproduce this warning run the following") - so I think you can just put exactly that same code there but wrapped in an appropriate |
Thanks for the guidance @alexcjohnson |
CI is failing for python3.6 due to a chrome driver issue? I dont think these changes affect that |
performance_warnings = [ | ||
warn | ||
for warn in warn_list | ||
if issubclass(warn.category, pd.errors.PerformanceWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean there are other warnings emitted during this px.bar
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there might be. but this test is for checking this warning only. we can change it look out for any pandas warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💃 excellent!
The chromedriver issue is fixed on master, I'll update the branch now.
Code PR
plotly.graph_objects
, my modifications concern thecodegen
files and not generated files.modified existing tests.
new tutorial notebook (please see the doc checklist as well).
Description
When the dataframe is huge create a new dataframe by inserting a column at a time slows down performance in pandas. When this happens pandas issues a PerformanceWarning. as show here
instead of creating/updating the columns in a loop, creating a dict and updating it and finally creating a dataframe from that dict is so much faster and the performance warning doesn't appear as shown here.
To reproduce this warning run the following