Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selectors Subtraction performance scales poorly with the number of columns #19456

Closed
2 tasks done
etrotta opened this issue Oct 25, 2024 · 2 comments · Fixed by #19469
Closed
2 tasks done

Selectors Subtraction performance scales poorly with the number of columns #19456

etrotta opened this issue Oct 25, 2024 · 2 comments · Fixed by #19469
Labels
A-selectors Area: column selectors bug Something isn't working performance Performance issues or improvements python Related to Python Polars

Comments

@etrotta
Copy link
Contributor

etrotta commented Oct 25, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import polars.selectors as cs
import random

N_COLS = 100_000

df = pl.DataFrame({f'field_{i}': [0] for i in range(N_COLS)})
to_remove = set(random.sample(df.columns, 100))

# Some fast examples with the same output
df.select([col for col in df.columns if col not in to_remove])
df.select(pl.exclude(to_remove))
df.drop(to_remove)

# selectors: Extremely slow
df.select(cs.all() - cs.by_name(to_remove))
df.select(cs.exclude(to_remove))
df.select(~cs.by_name(to_remove))

Log output

No response

Issue description

Seems like the actual operation is implemented in

Selector::Sub(lhs, rhs) => {
replace_selector_inner(*lhs, members, scratch, schema, keys)?;
let mut rhs_members = Default::default();
replace_selector_inner(*rhs, &mut rhs_members, scratch, schema, keys)?;
let mut new_members = PlIndexSet::with_capacity(members.len());
for e in members.drain(..) {
if !rhs_members.contains(&e) {
new_members.insert(e);
}
}
*members = new_members;
},

Expected behavior

For Selectors to work as fast as the alternatives mentioned above

Installed versions

--------Version info---------
Polars:              1.11.0
Index type:          UInt32
Platform:            Windows-11-10.0.22631-SP0
Python:              3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            0.9.0
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.8.2
nest_asyncio         <not installed>
numpy                1.26.4
openpyxl             3.1.2
pandas               2.2.2
pyarrow              15.0.0
pydantic             2.5.2
pyiceberg            <not installed>
sqlalchemy           2.0.29
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@etrotta etrotta added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Oct 25, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 26, 2024

It seems to be a more generic issue than subtract - a quick test shows that any selector combination ("union", "subtract", "xor", and even "not") results in a rapid performance drop as the number of column becomes large; we may have some quadratic behaviour lurking somewhere.

Looking further, I don't think the issue is specific to selector expansion 🤔

@alexander-beedie alexander-beedie added performance Performance issues or improvements A-selectors Area: column selectors labels Oct 26, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 26, 2024

Got it; thanks for the clean test case 👌
PR/fix incoming...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-selectors Area: column selectors bug Something isn't working performance Performance issues or improvements python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants