Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Address inadvertent quadratic behaviour in expand_columns #19469

Merged

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Oct 26, 2024

Closes #19456.

Note: while the Issue mentions selectors, the problem was more generic; selectors were just an easy way to trigger it.

Tracked it down to the nesting of replace_columns_with_column inside the expand_columns loop, which introduced quadratic behaviour in column name comparisons. Refactored the function so the expensive "names" comparison could be moved outside of the loop, and streamlined all of the surrounding code.

New code is half the size and O(n) ✌️

(Did a minor cleanup of replace_selector_inner since I was looking at it, but no performance impact there).

Before/After

Test case

from codetiming import Timer
import polars.selectors as cs
import polars as pl
import random

for n_cols in (
    500,
    1_000,
    2_500,
    5_000,
    10_000,
    20_000,
    30_000,
    40_000,
    50_000,
    75_000,
    100_000,
):
    df = pl.DataFrame({f"col{i}": [0] for i in range(n_cols)})
    omit_cols = set(random.sample(df.columns, int(n_cols * 0.05)))
    
    with Timer():
        df.select(~cs.by_name(omit_cols))

Timings

Time taken vs number of columns:

timings
n_cols before after speedup
500 0.0015 0.0006 2.5x
1,000 0.0048 0.0008 6.0x
2,500 0.0263 0.0017 15.5x
5,000 0.1052 0.0033 31.9x
10,000 0.4088 0.0066 61.9x
20,000 1.2663 0.0140 90.5x
30,000 2.5687 0.0205 125.3x
40,000 4.3345 0.0285 152.1x
50,000 6.5638 0.0373 176.0x
75,000 14.5153 0.0651 223.0x
100,000 24.7464 0.0897 275.9x

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Oct 26, 2024
@alexander-beedie alexander-beedie added the performance Performance issues or improvements label Oct 26, 2024
Copy link

codecov bot commented Oct 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.93%. Comparing base (48194d0) to head (ff589f7).
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19469      +/-   ##
==========================================
- Coverage   79.94%   79.93%   -0.01%     
==========================================
  Files        1534     1534              
  Lines      211097   211162      +65     
  Branches     2444     2444              
==========================================
+ Hits       168766   168798      +32     
- Misses      41776    41809      +33     
  Partials      555      555              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46 ritchie46 merged commit 60d0721 into pola-rs:main Oct 27, 2024
40 checks passed
@alexander-beedie alexander-beedie deleted the fix-quadratic-column-expansion branch October 27, 2024 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Selectors Subtraction performance scales poorly with the number of columns
2 participants