Fix/pandas Performance Warning Issue due to multiple `frame.insert` #4246

legendof-selda · 2023-06-12T20:30:44Z

Code PR

I have read through the contributing notes and understand the structure of the package. In particular, if my PR modifies code of plotly.graph_objects, my modifications concern the codegen files and not generated files.
I have added tests (if submitting a new feature or correcting a bug) or
modified existing tests.
For a new feature, I have added documentation examples in an existing or
new tutorial notebook (please see the doc checklist as well).
I have added a CHANGELOG entry if fixing/changing/adding anything substantial.
For a new feature or a change in behaviour, I have updated the relevant docstrings in the code to describe the feature or behaviour (please see the doc checklist as well).

Description

When the dataframe is huge create a new dataframe by inserting a column at a time slows down performance in pandas. When this happens pandas issues a PerformanceWarning. as show here

instead of creating/updating the columns in a loop, creating a dict and updating it and finally creating a dataframe from that dict is so much faster and the performance warning doesn't appear as shown here.

To reproduce this warning run the following

import pandas as pd
import numpy as np
import plotly.express as px

n_cols = 1000
n_rows = 1000

columns = list(f"col_{c}" for c in range(n_cols))
index = list(f"i_{r}" for r in range(n_rows))

df = pd.DataFrame(np.random.uniform(size=(n_rows, n_cols)), index=index, columns=columns)

fig = px.bar(
    df,
    x=df.index,
    y=df.columns[:-2],
    labels=df.columns[:-2],
)

…rt`" This reverts commit 413d41e.

legendof-selda · 2023-06-12T21:05:37Z

columns are repeating. will close for now

legendof-selda · 2023-06-12T21:59:59Z

issue fixed.

legendof-selda · 2023-06-12T22:01:31Z

tested in python3.11.3 windows

legendof-selda · 2023-06-19T08:41:14Z

@nicolaskruchten CI is green, may I know what do you think about merging it?

nicolaskruchten · 2023-06-19T12:34:25Z

packages/python/plotly/plotly/express/_core.py

@@ -1064,14 +1062,14 @@ def _escape_col_name(df_input, col_name, extra):
    return col_name


-def to_unindexed_series(x):
+def to_unindexed_series(x, name=None):


I'm curious why this change is required?

Originally, it was creating series without a name. I set it as None in case this function was used externally to avoid breaking compatibility.
When we create a dataframe from a dict, it's safer to have named series. Also, it would be easier to debug in line to see which series was created in order to know the column that caused the issue.

The change might not be necessary, but i like the explicitness. Can be useful during debugging.

legendof-selda · 2023-06-29T11:49:43Z

any update on this PR @nicolaskruchten

nicolaskruchten · 2023-06-29T12:11:41Z

I'm unavailable to review PRs at the moment, I'm sorry. I defer to @alexcjohnson.

Can we get a clearer sense of the performance improvement here? Things go from 6 to 5.4 seconds when plotting one thousand columns in wide mode? This seems like a smallish improvement in a very rare case...

legendof-selda · 2023-07-03T12:29:48Z

I'm unavailable to review PRs at the moment, I'm sorry. I defer to @alexcjohnson.

Can we get a clearer sense of the performance improvement here? Things go from 6 to 5.4 seconds when plotting one thousand columns in wide mode? This seems like a smallish improvement in a very rare case...

The main thing is, it avoids the PerformanceWarning raised by pandas

legendof-selda · 2023-07-07T08:21:16Z

Any update on this @alexcjohnson

alexcjohnson · 2023-07-19T19:45:44Z

@legendof-selda this looks great! No comments on your code edits but can you please just add a test, perhaps in test_px_wide.py, that verifies no warnings are emitted when running the code in your example?

legendof-selda · 2023-07-21T06:46:48Z

@legendof-selda this looks great! No comments on your code edits but can you please just add a test, perhaps in test_px_wide.py, that verifies no warnings are emitted when running the code in your example?

Sure I can do that
@alexcjohnson can you suggest where the test should be placed. I haven't fully understood the structure of the tests dir.

alexcjohnson · 2023-07-21T21:39:15Z

At the bottom of test_px_wide.py there are some tests like this:

plotly.py/packages/python/plotly/plotly/tests/test_optional/test_px/test_px_wide.py

Lines 835 to 849 in da860db

    
           def test_line_group(): 
        
               df = pd.DataFrame( 
        
                   data={ 
        
                       "who": ["a", "a", "b", "b"], 
        
                       "x": [0, 1, 0, 1], 
        
                       "score": [1.0, 2, 3, 4], 
        
                       "miss": [3.2, 2.5, 1.3, 1.5], 
        
                   } 
        
               ) 
        
               fig = px.line(df, x="x", y=["miss", "score"]) 
        
               assert len(fig.data) == 2 
        
               fig = px.line(df, x="x", y=["miss", "score"], color="who") 
        
               assert len(fig.data) == 4 
        
               fig = px.scatter(df, x="x", y=["miss", "score"], color="who") 
        
               assert len(fig.data) == 2

that look conceptually similar to your code above ("To reproduce this warning run the following") - so I think you can just put exactly that same code there but wrapped in an appropriate with catch_warnings and associated tests showing that no warnings are emitted, as described in https://docs.python.org/3/library/warnings.html#testing-warnings.

legendof-selda · 2023-07-25T07:35:16Z

Thanks for the guidance @alexcjohnson
I have done the changes you have asked for

legendof-selda · 2023-07-25T10:34:13Z

CI is failing for python3.6 due to a chrome driver issue? I dont think these changes affect that

alexcjohnson · 2023-07-25T15:20:09Z

packages/python/plotly/plotly/tests/test_optional/test_px/test_px_wide.py

+    performance_warnings = [
+        warn
+        for warn in warn_list
+        if issubclass(warn.category, pd.errors.PerformanceWarning)


does that mean there are other warnings emitted during this px.bar call?

there might be. but this test is for checking this warning only. we can change it look out for any pandas warning

alexcjohnson

💃 excellent!

The chromedriver issue is fixed on master, I'll update the branch now.

legendof-selda added 3 commits June 13, 2023 01:44

perf: fix pandas PerformanceWarning caused due to frame.insert

413d41e

chore: fix flake8 and black maxlen not match

fbfd4a8

Revert "perf: fix pandas PerformanceWarning caused due to `frame.inse…

3a4b466

…rt`" This reverts commit 413d41e.

legendof-selda closed this Jun 12, 2023

perf: fix pandas PerformanceWarning caused due to frame.insert

f028571

legendof-selda reopened this Jun 12, 2023

refactor: reuse to_unindexed_series

d4955b1

nicolaskruchten reviewed Jun 19, 2023

View reviewed changes

legendof-selda requested a review from nicolaskruchten June 20, 2023 09:09

Merge branch 'master' into fix/pd_perf_issue

940f25b

alexcjohnson mentioned this pull request Jul 19, 2023

PerformanceWarning: DataFrame is highly fragmented. for Plotly v5.15.0 #4287

Closed

ci: add test for perf warning

97b214e

alexcjohnson reviewed Jul 25, 2023

View reviewed changes

alexcjohnson approved these changes Jul 25, 2023

View reviewed changes

Merge branch 'master' into fix/pd_perf_issue

f1ed3d7

alexcjohnson merged commit e670c4b into plotly:master Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/pandas Performance Warning Issue due to multiple `frame.insert` #4246

Fix/pandas Performance Warning Issue due to multiple `frame.insert` #4246

legendof-selda commented Jun 12, 2023 •

edited

Loading

legendof-selda commented Jun 12, 2023

legendof-selda commented Jun 12, 2023

legendof-selda commented Jun 12, 2023

legendof-selda commented Jun 19, 2023

nicolaskruchten Jun 19, 2023

legendof-selda Jun 19, 2023

legendof-selda Jun 19, 2023

legendof-selda commented Jun 29, 2023

nicolaskruchten commented Jun 29, 2023 •

edited

Loading

legendof-selda commented Jul 3, 2023

legendof-selda commented Jul 7, 2023

alexcjohnson commented Jul 19, 2023

legendof-selda commented Jul 21, 2023 •

edited

Loading

alexcjohnson commented Jul 21, 2023

legendof-selda commented Jul 25, 2023

legendof-selda commented Jul 25, 2023

alexcjohnson Jul 25, 2023

legendof-selda Jul 26, 2023

alexcjohnson left a comment

Fix/pandas Performance Warning Issue due to multiple frame.insert #4246

Fix/pandas Performance Warning Issue due to multiple frame.insert #4246

Conversation

legendof-selda commented Jun 12, 2023 • edited Loading

Code PR

Description

legendof-selda commented Jun 12, 2023

legendof-selda commented Jun 12, 2023

legendof-selda commented Jun 12, 2023

legendof-selda commented Jun 19, 2023

nicolaskruchten Jun 19, 2023

Choose a reason for hiding this comment

legendof-selda Jun 19, 2023

Choose a reason for hiding this comment

legendof-selda Jun 19, 2023

Choose a reason for hiding this comment

legendof-selda commented Jun 29, 2023

nicolaskruchten commented Jun 29, 2023 • edited Loading

legendof-selda commented Jul 3, 2023

legendof-selda commented Jul 7, 2023

alexcjohnson commented Jul 19, 2023

legendof-selda commented Jul 21, 2023 • edited Loading

alexcjohnson commented Jul 21, 2023

legendof-selda commented Jul 25, 2023

legendof-selda commented Jul 25, 2023

alexcjohnson Jul 25, 2023

Choose a reason for hiding this comment

legendof-selda Jul 26, 2023

Choose a reason for hiding this comment

alexcjohnson left a comment

Choose a reason for hiding this comment

Fix/pandas Performance Warning Issue due to multiple `frame.insert` #4246

Fix/pandas Performance Warning Issue due to multiple `frame.insert` #4246

legendof-selda commented Jun 12, 2023 •

edited

Loading

nicolaskruchten commented Jun 29, 2023 •

edited

Loading

legendof-selda commented Jul 21, 2023 •

edited

Loading