Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): 10% speedup for to_dicts method #6415

Merged
merged 1 commit into from
Jan 25, 2023

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 24, 2023

Note that we will be able to redirect to iterrows(named=True) once 0.16.0 is released, as that will build dictionaries using the same optimisations.

Until then, here's a simple ~10% speedup on the current approach (as tested on a -release build).

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jan 24, 2023
@gab23r
Copy link
Contributor

gab23r commented Jan 24, 2023

@alexander-beedie Just for my understanding, what is the purpose of doing dict_ = dict ?

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually shouldn't we use rows instead of iterrows?

I think we might be able to remove this method, as df.rows(named=True) offers the same functionality after 0.16.0?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 25, 2023

Actually shouldn't we use rows instead of iterrows?

As the size of the frame gets very large it's actually more efficient and, perhaps surprisingly, slightly faster to use iterrows (probably because you allocate per slice, not for the whole result set). In local benchmarking there's a transition somewhere between around 20m elements where the crossover happens:

df = pl.DataFrame(
    {
        "x": range(10_000_000),
        "y": date(2023,1,1),
        "z": "abcdefghijklm",
    }
)
with Timer():
    _ = df.rows()
# Elapsed time: 16.1816 seconds

with Timer():
    _ = list(df.iterrows())
# Elapsed time: 15.5870 seconds

I think we might be able to remove this method, as df.rows(named=True) offers the same functionality after 0.16.0?

I'd be in favour of even going one step beyond that and removing rows in favour of just iterrows, given that it's just a list call away, as above 😅 (Should probably better-characterise the domains where rows is faster first though, so we always take the faster of the two options, as it's still a bit faster on smaller data).

(I think I can further speed-up both calls too; will experiment...)

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 25, 2023

@alexander-beedie Just for my understanding, what is the purpose of doing dict_ = dict ?

It results in a python bytecode optimisation because we have such a simple/obvious hot loop; when the function gets run it'll use calls to LOAD_FAST (or probably LOAD_DEREF in this case, inside the list comprehension) when resolving dict/zip instead of calls to the slightly more expensive LOAD_GLOBAL. Localising self.columns does much the same thing, optimising away a LOAD_ATTR. It's a trick rarely worth using if you have much else going on in your loop, but here it gets us about an 8% speedup vs not doing it, so...

@ritchie46 ritchie46 merged commit fe22509 into pola-rs:master Jan 25, 2023
@alexander-beedie alexander-beedie deleted the speedup-to-dicts branch January 25, 2023 09:19
@gab23r
Copy link
Contributor

gab23r commented Jan 25, 2023

Thanks @alexander-beedie for the explanation !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants