Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improved list arithmetic support #19162

Merged
merged 3 commits into from
Oct 14, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Oct 9, 2024

Introduces numeric list kernels that operate directly on the list offsets and leaf arrays.

  • Support for
    • List<->List operations where both lists must only have 1 level of nesting
    • List<->Numeric operations, where the list can have any level of nesting
  • Adds support for floor divide

Notes

  • I haven't added dedicated codepaths for when the outer validity is NULL to reduce codegen - I've instead forced the outer validity to be materialized
  • Have not added feature gates yet

Fixes #19010
Fixes #19025

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Oct 9, 2024
let mut new_left_dtype = left_dtype.cast_leaf(leaf_super_dtype.clone());
let mut new_right_dtype = right_dtype.cast_leaf(leaf_super_dtype);

// Cast List<->Array to List
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently already cast list<->array to list as a supertype from #12016

@@ -327,6 +328,26 @@ pub trait SeriesTrait:
/// Aggregate all chunks to a contiguous array of memory.
fn rechunk(&self) -> Series;

fn rechunk_validity(&self) -> Option<Bitmap> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from impl<T> ChunkedArray<T> so that it can be used without downcasting

@@ -47,55 +47,6 @@ fn is_cat_str_binary(type_left: &DataType, type_right: &DataType) -> bool {
}
}

fn process_list_arithmetic(
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove IR checking / modifying logic from type_coercion - I think this should already be handled by get_arithmetic_field, get_truediv_field etc.

@@ -1085,7 +1087,8 @@ def _div(self, other: Any, *, floordiv: bool) -> DataFrame:
int_casts = [
col(column).cast(tp)
for i, (column, tp) in enumerate(self.schema.items())
if tp.is_integer() and orig_dtypes[i].is_integer()
if tp.is_integer()
and (orig_dtypes[i].is_integer() or orig_dtypes[i] == Null)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fix after something I did broke this test case -

assert_frame_equal(df_expected, op(df, None))
assert_series_equal(s_expected, op(s, None))
, I have no idea how it was working before though

as_float = self._recursive_cast_to_dtype(Float64())

return as_float._arithmetic(other, "div", "div_<>")
return self._arithmetic(other, "div", "div_<>")

This comment was marked as outdated.

),
],
)
def test_list_arithmetic_same_size(
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check these tests back in later to reduce the size of this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just add comment saying "no need to review these functions, they're cut/pasted from test_arithmetic.py"?

Copy link
Collaborator Author

@nameexhaustion nameexhaustion Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip - I realized I can actually leave those tests here for now instead of removing them

BROADCAST_SERIES_COMBINATIONS,
)
@pytest.mark.parametrize("exec_op", EXEC_OP_COMBINATIONS)
def test_list_arithmetic_values(
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magical test case parametrization that runs every codepath in BinaryListNumericOpHelper 😄

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 86.90745% with 116 lines in your changes missing coverage. Please review.

Project coverage is 79.72%. Comparing base (ff10b38) to head (8e7554b).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...polars-core/src/series/arithmetic/list_borrowed.rs 93.64% 39 Missing ⚠️
crates/polars-compute/src/arithmetic/pl_num.rs 61.61% 38 Missing ⚠️
crates/polars-plan/src/plans/aexpr/schema.rs 71.01% 20 Missing ⚠️
crates/polars-core/src/series/series_trait.rs 22.22% 14 Missing ⚠️
...ates/polars-core/src/series/arithmetic/borrowed.rs 82.14% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19162      +/-   ##
==========================================
+ Coverage   79.67%   79.72%   +0.04%     
==========================================
  Files        1532     1533       +1     
  Lines      209200   209915     +715     
  Branches     2417     2415       -2     
==========================================
+ Hits       166687   167357     +670     
- Misses      41965    42010      +45     
  Partials      548      548              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion marked this pull request as ready for review October 9, 2024 13:56
@nameexhaustion nameexhaustion marked this pull request as draft October 10, 2024 04:30
@nameexhaustion nameexhaustion force-pushed the list-arith branch 2 times, most recently from 3bf866d to 09b0d7f Compare October 10, 2024 04:58
@nameexhaustion nameexhaustion marked this pull request as ready for review October 10, 2024 05:05
for o in others {
let slc = o.as_slice();
l = slc[l].to_usize();
r = slc[r].to_usize();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure, but I feel like this should be r + 1. Might be completely wrong though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be correct as it is - Arrow list offsets are defined as

1st row : offsets[0]..offsets[1]
2nd row : offsets[1]..offsets[2]
..and so on

for o in &offsets[1..] {
let slc = o.as_slice();
l = slc[l].to_usize();
r = slc[r].to_usize();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem.

| (_, Time)
| (_, Date)
| (_, Datetime(_, _)) => polars_bail!(opq = div, self.dtype(), rhs.dtype()),
_ => match (self.dtype(), rhs.dtype()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why is this a nested match?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to remove

// list lengths.
let mut mismatch_pos = 0;

with_match_numeric_list_op!(&self.op, self.swapped, |$OP| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda feel like this should not be here. Instead, this should be in polars-compute.

Copy link
Collaborator Author

@nameexhaustion nameexhaustion Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to move the entire file, but I ran into some difficulty as polars-compute doesn't have access to the Series struct - I think we can leave it here for now?

I also don't want to split out parts of the logic in this file as it isn't really used anywhere else


/// Reduce monomorphization
#[inline(never)]
fn combine_validities_list_to_list_no_broadcast(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I want to keep this here for now - it's a very specialized function that's only used in this file. If we need to use it somewhere else later I can move it then

@@ -56,6 +57,25 @@ impl Series {
}
}

/// TODO: Move this somewhere else?
pub fn list_offsets_and_validities_recursive(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always outputs Vecs with 1 element.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily - it returns Vecs with a number of elements corresponding to the nesting level - so e.g. for a List(List(Float64)) -

print((pl.Series([[[1]]]) / pl.Series([1])).dtype)
List(List(Float64))

[crates/polars-core/src/series/arithmetic/list_borrowed.rs:93:17] lhs.list_offsets_and_validities_recursive().0 = [
    OffsetsBuffer(
        [
            0,
            1,
        ],
    ),
    OffsetsBuffer(
        [
            0,
            1,
        ],
    ),
]

@nameexhaustion

This comment has been minimized.

@nameexhaustion nameexhaustion marked this pull request as ready for review October 11, 2024 05:03
@nameexhaustion

This comment was marked as outdated.

@itamarst
Copy link
Contributor

Thanks for doing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
5 participants