-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(python): Use primitive constructors to create a Series of lists when dtype is provided #15002
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15002 +/- ##
==========================================
- Coverage 80.98% 80.98% -0.01%
==========================================
Files 1333 1333
Lines 173174 173176 +2
Branches 2458 2460 +2
==========================================
- Hits 140252 140240 -12
- Misses 32454 32468 +14
Partials 468 468 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good, but why were some of the test cases changed? These shouldn't be affected, right?
If you could revert those changes, I'll merge this. (the newly added test cases can stay of course).
EDIT: Nevermind, the test changes are valid 👍
Co-authored-by: Stijn de Gooijer <stijndegooijer@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super happy that we're doing a full loop over the values on the Python side, but the logic in the PyO3 AnyValue code for this is much worse. So this should definitely be an improvement.
For clarity and future reference, one example of an edited test case is the following: Previously, the following was used in a test case: s = pl.Series(
values=[[10e10, 10e15], [10e12, 10e13], [10e10, 10e15]],
dtype=pl.List(pl.Int64),
) With AnyValue this would silently cast to integers. But a similar non-nested case like the following: s = pl.Series(
values=[10e10, 10e15],
dtype=pl.Int64,
) would raise with: Other edits are similar. |
100% agree. I think this PR was a quick win but it looks like there's more room for improvement here. Thanks for the review! |
When a Series of lists is constructed, it uses the AnyValue interface even when the dtype is provided. Besides the performance implications, this often leads to incorrect results because even though the type is provided, it is inferred anyhow and then, if the inferred type differs from the provided type, the result is cast to the latter.
Here are some examples of incorrect cases:
*This case has been reported in #14277.
This PR changes the way a Series of lists is constructed when the dtype is provided, while still falling back to AnyValue when it is not. It does so by recursively applying
sequence_to_pyseries
to every inner sublist and eventually calling the primitive constructor, e.g.,PySeries.new_opt_u64
, that corresponds to the given dtype.This results in:
Note that I had to edit a couple of faulty test cases that emerged due to this change.
Partially closes #14277. The pl.Array case can be addressed in a separate PR.