Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/pandas Performance Warning Issue due to multiple frame.insert #4246

Merged
merged 8 commits into from
Jul 25, 2023

Conversation

legendof-selda
Copy link
Contributor

@legendof-selda legendof-selda commented Jun 12, 2023

Code PR

  • I have read through the contributing notes and understand the structure of the package. In particular, if my PR modifies code of plotly.graph_objects, my modifications concern the codegen files and not generated files.
  • I have added tests (if submitting a new feature or correcting a bug) or
    modified existing tests.
  • For a new feature, I have added documentation examples in an existing or
    new tutorial notebook (please see the doc checklist as well).
  • I have added a CHANGELOG entry if fixing/changing/adding anything substantial.
  • For a new feature or a change in behaviour, I have updated the relevant docstrings in the code to describe the feature or behaviour (please see the doc checklist as well).

Description

When the dataframe is huge create a new dataframe by inserting a column at a time slows down performance in pandas. When this happens pandas issues a PerformanceWarning. as show here
org_version

instead of creating/updating the columns in a loop, creating a dict and updating it and finally creating a dataframe from that dict is so much faster and the performance warning doesn't appear as shown here.
new_version

To reproduce this warning run the following

import pandas as pd
import numpy as np
import plotly.express as px

n_cols = 1000
n_rows = 1000

columns = list(f"col_{c}" for c in range(n_cols))
index = list(f"i_{r}" for r in range(n_rows))

df = pd.DataFrame(np.random.uniform(size=(n_rows, n_cols)), index=index, columns=columns)

fig = px.bar(
    df,
    x=df.index,
    y=df.columns[:-2],
    labels=df.columns[:-2],
)

@legendof-selda
Copy link
Contributor Author

columns are repeating. will close for now

@legendof-selda
Copy link
Contributor Author

issue fixed.

@legendof-selda
Copy link
Contributor Author

tested in python3.11.3 windows

@legendof-selda
Copy link
Contributor Author

@nicolaskruchten CI is green, may I know what do you think about merging it?

@@ -1064,14 +1062,14 @@ def _escape_col_name(df_input, col_name, extra):
return col_name


def to_unindexed_series(x):
def to_unindexed_series(x, name=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this change is required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, it was creating series without a name. I set it as None in case this function was used externally to avoid breaking compatibility.
When we create a dataframe from a dict, it's safer to have named series. Also, it would be easier to debug in line to see which series was created in order to know the column that caused the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change might not be necessary, but i like the explicitness. Can be useful during debugging.

@legendof-selda
Copy link
Contributor Author

any update on this PR @nicolaskruchten

@nicolaskruchten
Copy link
Contributor

nicolaskruchten commented Jun 29, 2023

I'm unavailable to review PRs at the moment, I'm sorry. I defer to @alexcjohnson.

Can we get a clearer sense of the performance improvement here? Things go from 6 to 5.4 seconds when plotting one thousand columns in wide mode? This seems like a smallish improvement in a very rare case...

@legendof-selda
Copy link
Contributor Author

I'm unavailable to review PRs at the moment, I'm sorry. I defer to @alexcjohnson.

Can we get a clearer sense of the performance improvement here? Things go from 6 to 5.4 seconds when plotting one thousand columns in wide mode? This seems like a smallish improvement in a very rare case...

The main thing is, it avoids the PerformanceWarning raised by pandas

@legendof-selda
Copy link
Contributor Author

Any update on this @alexcjohnson

@alexcjohnson
Copy link
Collaborator

@legendof-selda this looks great! No comments on your code edits but can you please just add a test, perhaps in test_px_wide.py, that verifies no warnings are emitted when running the code in your example?

@legendof-selda
Copy link
Contributor Author

legendof-selda commented Jul 21, 2023

@legendof-selda this looks great! No comments on your code edits but can you please just add a test, perhaps in test_px_wide.py, that verifies no warnings are emitted when running the code in your example?

Sure I can do that
@alexcjohnson can you suggest where the test should be placed. I haven't fully understood the structure of the tests dir.

@alexcjohnson
Copy link
Collaborator

At the bottom of test_px_wide.py there are some tests like this:

def test_line_group():
df = pd.DataFrame(
data={
"who": ["a", "a", "b", "b"],
"x": [0, 1, 0, 1],
"score": [1.0, 2, 3, 4],
"miss": [3.2, 2.5, 1.3, 1.5],
}
)
fig = px.line(df, x="x", y=["miss", "score"])
assert len(fig.data) == 2
fig = px.line(df, x="x", y=["miss", "score"], color="who")
assert len(fig.data) == 4
fig = px.scatter(df, x="x", y=["miss", "score"], color="who")
assert len(fig.data) == 2

that look conceptually similar to your code above ("To reproduce this warning run the following") - so I think you can just put exactly that same code there but wrapped in an appropriate with catch_warnings and associated tests showing that no warnings are emitted, as described in https://docs.python.org/3/library/warnings.html#testing-warnings.

@legendof-selda
Copy link
Contributor Author

Thanks for the guidance @alexcjohnson
I have done the changes you have asked for

@legendof-selda
Copy link
Contributor Author

CI is failing for python3.6 due to a chrome driver issue? I dont think these changes affect that

performance_warnings = [
warn
for warn in warn_list
if issubclass(warn.category, pd.errors.PerformanceWarning)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean there are other warnings emitted during this px.bar call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there might be. but this test is for checking this warning only. we can change it look out for any pandas warning

Copy link
Collaborator

@alexcjohnson alexcjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💃 excellent!

The chromedriver issue is fixed on master, I'll update the branch now.

@alexcjohnson alexcjohnson merged commit e670c4b into plotly:master Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants