feat: support 'col IN (a, b, c)' type expressions #652

roeap · 2025-01-18T15:48:35Z

What changes are proposed in this pull request?

Currently, evaluation expressions of type col IN (a, b, c) is missing an implementation. While this might be the exact case @scovich cautioned us about, where the rhs might get significant in size and should really be handled as EngineData, I hope that we at least do not make things worse here. Unfortunately delta-rs already has support for these types of expressions, so the main intend right now is to retain feature parity over there while migrating.

How was this change tested?

Additional tests for specific expression flavor.

codecov · 2025-01-18T15:52:56Z

Codecov Report

Attention: Patch coverage is 91.84953% with 52 lines in your changes missing coverage. Please review.

Project coverage is 84.50%. Comparing base (68f4790) to head (41030fc).

Files with missing lines	Patch %	Lines
kernel/src/engine/arrow_expression/in_list.rs	87.93%	15 Missing and 33 partials ⚠️
kernel/src/engine/arrow_expression.rs	98.32%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #652      +/-   ##
==========================================
+ Coverage   84.08%   84.50%   +0.42%     
==========================================
  Files          77       78       +1     
  Lines       17777    18340     +563     
  Branches    17777    18340     +563     
==========================================
+ Hits        14948    15499     +551     
+ Misses       2115     2097      -18     
- Partials      714      744      +30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap · 2025-01-18T16:31:25Z

kernel/src/engine/arrow_conversion.rs

@@ -208,7 +208,7 @@ impl TryFrom<&ArrowDataType> for DataType {
            ArrowDataType::Date64 => Ok(DataType::DATE),
            ArrowDataType::Timestamp(TimeUnit::Microsecond, None) => Ok(DataType::TIMESTAMP_NTZ),
            ArrowDataType::Timestamp(TimeUnit::Microsecond, Some(tz))
-                if tz.eq_ignore_ascii_case("utc") =>
+                if tz.eq_ignore_ascii_case("utc") || tz.eq_ignore_ascii_case("+00:00") =>


The data in arrow arrays should always represent a timestamp in UTC, so is this check even necessary?

https://github.com/apache/arrow-rs/blob/af777cd53e56f8382382137b6e08af249c475397/arrow-schema/src/datatype.rs#L179-L182

Did we ever get an answer for this?

OIC -- we probably shouldn't have been testing for even "utc" case, because both Delta and arrow store physically UTC timestamps. The TZ of an arrow timestamp is only a hint about which TZ should be used to display the timestamp.

kernel/src/engine/arrow_expression.rs

scovich · 2025-01-20T20:04:37Z

kernel/src/engine/arrow_expression.rs

+                    ad: &ArrayData,
+                ) -> BooleanArray {
+                    #[allow(deprecated)]
+                    let res = col.map(|val| val.map(|v| ad.array_elements().contains(&v.into())));


I don't think this handles NULL values correctly? See e.g. https://spark.apache.org/docs/3.5.1/sql-ref-null-semantics.html#innot-in-subquery-:

TRUE is returned when the non-NULL value in question is found in the list

FALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL values

UNKNOWN is returned when the value is NULL, or the non-NULL value is not found in the list and the list contains at least one NULL value

I think, instead of calling contains, you could borrow the code from PredicateEvaluatorDefaults::finish_eval_variadic, with true as the "dominator" value.

Actually, I think you could just invoke that method directly, with a properly crafted iterator?

// `v IN (k1, ..., kN)` is logically equivalent to `v = k1 OR ... OR v = kN`, so evaluate // it as such, ensuring correct handling of NULL inputs (including `Scalar::Null`). col.map(|v| { PredicateEvaluatorDefaults::finish_eval_variadic( VariadicOperator::Or, inlist.iter().map(Some(Scalar::partial_cmp(v?, k?)? == Ordering::Equal)), false, ) })

Was I correct in thinking that None - no dominant value, but found Null - should just be false in this case?

scovich · 2025-01-20T20:12:42Z

kernel/src/engine/arrow_expression.rs

+                    ad: &ArrayData,
+                ) -> BooleanArray {
+                    #[allow(deprecated)]
+                    let res = col.map(|val| val.map(|v| ad.array_elements().contains(&v.into())));


Aside: We actually have a lurking bug -- Scalar derives PartialEq which will allow two Scalar::Null to compare equal. But SQL semantics dictate that NULL doesn't compare equal to anything -- not even itself.

Our manual impl of PartialOrd for Scalar does this correctly, but it breaks the rules for PartialEq:

If PartialOrd or Ord are also implemented for Self and Rhs, their methods must also be consistent with PartialEq (see the documentation of those traits for the exact requirements). It’s easy to accidentally make them disagree by deriving some of the traits and manually implementing others.

Looks like we'll need to define a manual impl PartialEq for Scalar that follows the same approach.

This is indeed not covered. Added an implementation for PartialEq that mirrors PartialOrd.

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

kernel/src/engine/arrow_expression.rs

scovich · 2025-01-25T04:29:02Z

kernel/src/engine/arrow_expression.rs

+                            Some(
+                                PredicateEvaluatorDefaults::finish_eval_variadic(
+                                    VariadicOperator::Or,
+                                    inlist.iter().map(|k| v.as_ref().map(|vv| vv == k)),


This isn't correct -- we need comparisons against Scalar::Null to return None. That's why I had previously recommended using Scalar::partial_cmp instead of ==.

Also, can we not use ? to unwrap the various options here?

Suggested change

inlist.iter().map(|k| v.as_ref().map(|vv| vv == k)),

inlist.iter().map(Some(Scalar::partial_cmp(v?, k?)? == Ordering::Equal)),

Unpacking that -- if the value we search for is NULL, or if the inlist entry is NULL, or if the two values are incomparable, then return None for that pair. Otherwise, return Some boolean indicating whether the values compared equal or not. That automatically covers the various required cases, and also makes us robust to any type mismatches that might creep in.

Note: If we wanted to be a tad more efficient, we could also unpack v outside the inner loop:

values.map(|v| { let v = v?; PredicateEvaluatorDefaults::finish_eval_variadic(...) })

Hmm -- empty in-lists pose a corner case with respect to unpacking v:

NULL IN ()

Operator OR with zero inputs normally produces FALSE (which is correct if you stop to think about it) -- but unpacking a NULL v first makes the operator return NULL instead (which is also correct if you squint, because NULL input always produces NULL output).

Unfortunately, the only clear docs I could find -- https://spark.apache.org/docs/3.5.1/sql-ref-null-semantics.html#innot-in-subquery- -- are also ambiguous:

Conceptually a IN expression is semantically equivalent to a set of equality condition separated by a disjunctive operator (OR).

... suggests FALSE while

UNKNOWN is returned when the value is NULL

... suggests NULL

The difference matters for NOT IN, because NULL NOT IN () would either return TRUE (keep rows) or NULL (drop row).

NOTE: SQL engines normally forbid statically empty in-list but do not forbid subqueries from producing an empty result.

I tried the following expression on three engines (sqlite, mysql, postgres):

SELECT 1 WHERE NULL NOT IN (SELECT 1 WHERE FALSE)

And all three returned 1. So OR semantics prevail, and we must NOT unpack v outside the loop.

Hoping I now considered all your comments, which essentially means going with your original version.

kernel/src/engine/arrow_expression.rs

scovich · 2025-01-25T04:41:45Z

kernel/src/engine/arrow_expression.rs

+                    (ArrowDataType::Utf8, PrimitiveType::String) => op_in(inlist, str_op(column.as_string::<i32>().iter())),
+                    (ArrowDataType::LargeUtf8, PrimitiveType::String) => op_in(inlist, str_op(column.as_string::<i64>().iter())),
+                    (ArrowDataType::Utf8View, PrimitiveType::String) => op_in(inlist, str_op(column.as_string_view().iter())),
+                    (ArrowDataType::Int8, PrimitiveType::Byte) => op_in(inlist,op::<Int8Type>( column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Int16, PrimitiveType::Short) => op_in(inlist,op::<Int16Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Int32, PrimitiveType::Integer) => op_in(inlist,op::<Int32Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Int64, PrimitiveType::Long) => op_in(inlist,op::<Int64Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Float32, PrimitiveType::Float) => op_in(inlist,op::<Float32Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Float64, PrimitiveType::Double) => op_in(inlist,op::<Float64Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Date32, PrimitiveType::Date) => op_in(inlist,op::<Date32Type>(column.as_ref(), Scalar::Date)),


These are all a lot longer than 100 chars... why doesn't the fmt check blow up??

scovich · 2025-01-25T04:48:13Z

kernel/src/engine/arrow_expression.rs

@@ -280,6 +281,84 @@ fn evaluate_expression(
                    (ArrowDataType::Decimal256(_, _), Decimal256Type)
                }
            }
+            (Column(name), Literal(Scalar::Array(ad))) => {
+                fn op<T: ArrowPrimitiveType>(
+                    values: &dyn Array,


Suggested change

values: &dyn Array,

values: ArrayRef,

(avoids the need for .as_ref() at the call site)

i think the main thing was that we need this to be a reference, otherwise the compiler starts complaining about lifetimes. I did shorten the code at the call-site a bit, hope that works as well.

Why would an ArrayRef (= Arc<dyn Array>) give lifetime problems, sorry?
We can always call as_ref() on it to get a reference that lives at least as long as the arc?

not the parameter itself, but the as_primitive cast inside the functions returns a ref, which we then iterate over. This then causes issues with the iterator referencing data owned by the function.

kernel/src/expressions/scalars.rs

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

scovich

I think we're in good shape now, just need missing tests.

scovich · 2025-01-30T19:31:07Z

kernel/src/expressions/scalars.rs

-        }
+impl PartialEq for Scalar {
+    fn eq(&self, other: &Scalar) -> bool {
+        self.partial_cmp(other) == Some(Ordering::Equal)


aside: PartialOrd requires Self: PartialEq, but we can still invoke the former from the latter?
(feels somehow like a circular dependency, but I guess if it compiles it works)

kernel/src/engine/arrow_expression.rs

scovich · 2025-01-30T19:43:02Z

kernel/src/engine/arrow_expression.rs

@@ -692,6 +771,85 @@ mod tests {
        assert_eq!(in_result.as_ref(), &in_expected);
    }

+    #[test]
+    fn test_column_in_array() {


Relating to #652 (comment) and #652 (comment) -- we don't seem to have any tests that cover NULL value semantics?

1 IN (1, NULL) -- TRUE 1 IN (2, NULL) -- NULL NULL IN (1, 2) -- NULL 1 NOT IN (1, NULL) -- FALSE (**) 1 NOT IN (2, NULL) -- NULL NULL NOT IN (1, 2) -- NULL

(**) NOTE from https://spark.apache.org/docs/3.5.1/sql-ref-null-semantics.html#innot-in-subquery-:

NOT IN always returns UNKNOWN when the list contains NULL, regardless of the input value. This is because IN returns UNKNOWN if the value is not in the list containing NULL, and because NOT UNKNOWN is again UNKNOWN.

IMO, that explanation is confusing and factually incorrect. If we explain it in terms of NOT(OR):

1 NOT IN (1, NULL) = NOT(1 IN (1, NULL)) = NOT(1 = 1 OR 1 = NULL) = NOT(1 = 1) AND NOT(1 = NULL) = 1 != 1 AND 1 != NULL = FALSE AND NULL = FALSE

As additional support for my claim: sqlite, postgres, and mysql all return FALSE (not NULL) for that expression.

Added some test according to the cases mentioned above, hopefully covering all cases. This uncovered some cases where we were not handling NULLs correctly in the other in-list branches, mainly b/c the arrow kernels don't seem to be adhering to the SQL NULL semantics.

In addition to the engines above, I also tried duckdb and datafusion, which also support @scovich's claim.

also, is this something worth upstreaming to arrow-rs similar to the *_kleene variants for other kernels?

is this something worth upstreaming to arrow-rs similar to the *_kleene variants for other kernels?

Potentially, yes? A benefit of upstreaming is that it should be able to embed the null tests directly (like we did with the finish_eval_variadic call), instead of requiring a second pass for the null_count like you had to do here?

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap · 2025-02-02T17:17:15Z

kernel/src/engine/arrow_expression.rs

+                        .map(|k| Some(lit.partial_cmp(k)? == Ordering::Equal)),
+                    false,
+                );
+                Ok(Arc::new(BooleanArray::from(vec![exists; batch.num_rows()])))


aside from the null handling, I think we should return the result for each row in the input batch?

That sounds right? It's a literal evaluation, so the result should be the same for all rows.

Also checked with other engines, they do the same.

kernel/src/expressions/scalars.rs

scovich

Can we split out the PartialEq fix and the Decimal support as two changes of their own? AFAIK, the three changes are orthogonal, and colocated here only because delta-rs needs them all?

Meanwhile, the PartialEq should have easily been able to merge before now, had it been separate, and it's looking like our decimal woes could unnecessarily delay in-list support that is otherwise pretty close to merge-ready.

scovich · 2025-02-04T04:15:20Z

kernel/src/engine/arrow_expression.rs

-            (Literal(_), Column(_)) => {
+            (Literal(lit), Column(_)) => {
+                if lit.is_null() {
+                    return Ok(Arc::new(BooleanArray::from(vec![None; batch.num_rows()])));


Suggested change

return Ok(Arc::new(BooleanArray::from(vec![None; batch.num_rows()])));

return Ok(Arc::new(BooleanArray::new_null(batch.num_rows())));

kernel/src/engine/arrow_expression.rs

scovich · 2025-02-04T04:22:59Z

kernel/src/engine/arrow_expression.rs

+                        let in_list_result =
+                            in_list_utf8(string_arr, right_arr).map_err(Error::generic_err)?;
+                        return Ok(wrap_comparison_result(
+                            in_list_result
+                                .iter()
+                                .zip(right_arr.iter())
+                                .map(|(res, arr)| match (res, arr) {
+                                    (Some(false), Some(arr)) if arr.null_count() > 0 => None,
+                                    _ => res,
+                                })
+                                .collect(),
+                        ));


Wouldn't it be cleaner to chain all that up at once, and then wrap up the result?

Suggested change

let in_list_result =

in_list_utf8(string_arr, right_arr).map_err(Error::generic_err)?;

return Ok(wrap_comparison_result(

in_list_result

.iter()

.zip(right_arr.iter())

.map(|(res, arr)| match (res, arr) {

(Some(false), Some(arr)) if arr.null_count() > 0 => None,

_ => res,

})

.collect(),

));

let in_list_result = in_list_utf8(string_arr, right_arr)

.map_err(Error::generic_err)?

.iter()

.zip(right_arr)

.map(|(res, arr)| match (res, arr) {

(Some(false), Some(arr)) if arr.null_count() > 0 => None,

_ => res,

})

.collect();

return Ok(wrap_comparison_result(in_list_result));

scovich · 2025-02-04T04:23:48Z

kernel/src/engine/arrow_expression.rs

+                        return Ok(wrap_comparison_result(
+                            in_list_result
+                                .iter()
+                                .zip(right_arr.iter())


note: zip takes IntoIterator, so we don't need to call .iter() here

scovich · 2025-02-04T04:24:50Z

kernel/src/engine/arrow_expression.rs

+                        .map(|k| Some(lit.partial_cmp(k)? == Ordering::Equal)),
+                    false,
+                );
+                Ok(Arc::new(BooleanArray::from(vec![exists; batch.num_rows()])))


That sounds right? It's a literal evaluation, so the result should be the same for all rows.

scovich · 2025-02-04T04:28:31Z

kernel/src/engine/arrow_expression.rs

+        let in_op = Expression::binary(BinaryOperator::In, "one", column_expr!("item"));
+        let in_result =
+            evaluate_expression(&in_op, &batch, Some(&DeltaDataTypes::BOOLEAN)).unwrap();
+        let in_expected = BooleanArray::from(vec![Some(true), None, Some(true)]);


This test cannot ever produce Some(false) because every inlist contains a NULL?

added cases to cover that case.

scovich · 2025-02-04T04:30:39Z

kernel/src/engine/arrow_expression.rs

@@ -692,6 +771,85 @@ mod tests {
        assert_eq!(in_result.as_ref(), &in_expected);
    }

+    #[test]
+    fn test_column_in_array() {


is this something worth upstreaming to arrow-rs similar to the *_kleene variants for other kernels?

Potentially, yes? A benefit of upstreaming is that it should be able to embed the null tests directly (like we did with the finish_eval_variadic call), instead of requiring a second pass for the null_count like you had to do here?

scovich · 2025-02-04T04:33:43Z

kernel/src/engine/arrow_utils.rs

+                let in_list_result = arrow_ord::comparison::in_list(prim_array, list_array).map_err(Error::generic_err)?;
+                Ok(wrap_comparison_result(
+                    in_list_result
+                        .iter()
+                        .zip(list_array.iter())
+                        .map(|(res, arr)| match (res, arr) {
+                            (Some(false), Some(arr)) if arr.null_count() > 0 => None,
+                            _ => res,
+                        })
+                        .collect(),
+                ))


See other comment about formatting. But also, should we factor this logic out as a helper function instead of duplicating it?

kernel/src/expressions/scalars.rs

scovich · 2025-02-04T05:08:23Z

kernel/src/engine/arrow_expression.rs

+                            PredicateEvaluatorDefaults::finish_eval_variadic(
+                                VariadicOperator::Or,
+                                inlist
+                                    .iter()
+                                    .map(|k| Some(v.as_ref()?.partial_cmp(k)? == Ordering::Equal)),
+                                false,


With a second use site now (below), we should probably factor this out as a helper function?

We just need to figure out how to handle v.as_ref()? here vs. plain lit below, without causing an unintentional (and incorrect) early-out here. Maybe the helper just needs to take Option<&Scalar>; then this call site here would pass v.as_ref() and the other would pass Some(lit)? Compiler should optimize away the useless ? in the latter case.

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap · 2025-02-04T15:44:07Z

kernel/src/engine/arrow_utils.rs

-macro_rules! prim_array_cmp {
-    ( $left_arr: ident, $right_arr: ident, $(($data_ty: pat, $prim_ty: ty)),+ ) => {
-
-        return match $left_arr.data_type() {
-        $(
-            $data_ty => {
-                let prim_array = $left_arr.as_primitive_opt::<$prim_ty>()
-                        .ok_or(Error::invalid_expression(
-                            format!("Cannot cast to primitive array: {}", $left_arr.data_type()))
-                        )?;
-                    let list_array = $right_arr.as_list_opt::<i32>()
-                        .ok_or(Error::invalid_expression(
-                            format!("Cannot cast to list array: {}", $right_arr.data_type()))
-                        )?;
-                arrow_ord::comparison::in_list(prim_array, list_array).map(wrap_comparison_result)
-            }
-        )+
-            _ => Err(ArrowError::CastError(
-                        format!("Bad Comparison between: {:?} and {:?}",
-                            $left_arr.data_type(),
-                            $right_arr.data_type())
-                        )
-                )
-        }.map_err(Error::generic_err);
-    };
-}
-
-pub(crate) use prim_array_cmp;


this macro relies on functions defined elsewhere and is only useful for the inlist implementation so I moved it over there.

roeap · 2025-02-04T15:48:04Z

kernel/src/engine/arrow_expression.rs

+                if let Some(right_arr) = right_arr.as_list_opt::<i32>() {
+                    return Ok(fix_in_list_result(
+                        in_list_utf8(string_arr, right_arr).map_err(Error::generic_err)?,
+                        right_arr.iter(),


Somehow I could not get it to work w/o calling iter.

That function takes an IntoIterator<Item = Option<ArrayRef>>, which would consume right_arr while still borrowed by the previous function arg. I suspect you could satisfy the borrow checker by passing &right_arr (whose IntoIter should be the same as calling .iter()), or by factoring out the in_list_utf8 call so the borrow dies before you start building args for the second call:

let result = in_list_utf8(string_arr, right_arr).map_err(Error::generic_err)?; return Ok(fix_in_list_result(result, right_arr);

(the latter approach has the nice side effect of simplifying the code as well, even if cargo fmt will likely put the map_err call on its own line)

scovich · 2025-02-04T14:53:17Z

kernel/src/engine/arrow_expression.rs

+                        ad, str_op(column.as_string_view())
+                    ),
+                    (ArrowDataType::Int8, PrimitiveType::Byte) =>  is_in_list(
+                        ad,op::<Int8Type>( &column, Scalar::from)


aside: I'm surprised cargo fmt didn't adjust this?

Suggested change

ad,op::<Int8Type>( &column, Scalar::from)

ad, op::<Int8Type>(&column, Scalar::from)

(but I guess it doesn't run reliably on this code in the first place)

scovich · 2025-02-04T15:46:16Z

kernel/src/engine/arrow_expression.rs

+        }
+        (Literal(lit), Literal(Scalar::Array(ad))) => {
+            let res = is_in_list(ad, Some(Some(lit.clone())));
+            let exists = res.is_valid(0).then(|| res.value(0));


aside: I'm a bit surprised arrow doesn't provide an optional getter, but nothing obvious turned up when I looked.

scovich · 2025-02-04T15:49:30Z

kernel/src/engine/arrow_expression.rs

+            Ok(wrap_comparison_result(arr))
+        }
+        (Literal(lit), Literal(Scalar::Array(ad))) => {
+            let res = is_in_list(ad, Some(Some(lit.clone())));


scovich · 2025-02-04T15:52:30Z

kernel/src/engine/arrow_expression.rs

+// helper function to make arrow in_list* kernel results comliant with SQL NULL semantics.
+// Specifically, if an item is not found in the in-list, but the in-list contains NULLs, the
+// result should be NULL (UNKNOWN) as well.
+fn fix_in_list_result(


nit: Perhaps more self-documenting as fix_arrow_in_list_result?

scovich · 2025-02-04T15:53:17Z

kernel/src/engine/arrow_expression.rs

+        .collect()
+}
+
+// helper function to make arrow in_list* kernel results comliant with SQL NULL semantics.


Suggested change

// helper function to make arrow in_list* kernel results comliant with SQL NULL semantics.

// helper function to make arrow in_list* kernel results compliant with SQL NULL semantics.

or

Suggested change

// helper function to make arrow in_list* kernel results comliant with SQL NULL semantics.

// helper function to make arrow in_list* kernel results comply with SQL NULL semantics.

scovich · 2025-02-04T15:55:04Z

kernel/src/engine/arrow_expression.rs

+// Specifically, if an item is not found in the in-list, but the in-list contains NULLs, the
+// result should be NULL (UNKNOWN) as well.


Suggested change

// Specifically, if an item is not found in the in-list, but the in-list contains NULLs, the

// result should be NULL (UNKNOWN) as well.

// Specifically, if an item is not found in an in-list that contains NULLs, the

// result should be NULL (UNKNOWN) instead of FALSE.

scovich · 2025-02-04T16:11:05Z

kernel/src/engine/arrow_expression.rs

@@ -352,6 +305,219 @@ fn evaluate_expression(
    }
 }

+fn eval_in_list(


This method makes a very narrow waist now; should we consider moving it (along with the helpers it invokes) to an inlist sub-module of their own?

The idea came up because I noticed our unit tests are overly high-level -- they test the overall expression eval rather than these specific methods. Ideally, we would have three flavors of unit tests:

One for fix_in_list_result that verifies its documented behavior wrt true/false/null

Another for is_in_list that can work with slices of scalars instead of needing to build record batches (and which only needs to work with 1-2 types because it focuses primarily on null handling semantics)

A final test that covers eval_in_list. It does need to work with expressions and record batches, but just needs a quick case for each type to verify that the wiring is correct, since the actual in-list behavior is covered by the other two (sets of) unit tests.

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

scovich

It's really hard to review a PR that moves unrelated code between files while also making changes. Can we either keep arrow_expression.rs as the top-level module (legal), or do the file rename as a pre-factor PR that merges first?

scovich · 2025-02-05T00:23:17Z

kernel/src/engine/arrow_expression/in_list.rs

+            let arr = match (column.data_type(), data_type) {
+                    (ArrowDataType::Utf8, PrimitiveType::String) => is_in_list(


nit: weirdly deep indent?
(we really need that cargo fmt fix to merge...)

Aside from the mod in macro thing, it seems we were also hitting rust-lang/rustfmt#3863. Shortened the violating string a bit and it seems cargo fmt is working on in_list.rs now.

Wow, nasty bug there... a bit surprised it's gone unfixed for so many years now.

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap · 2025-02-05T12:06:14Z

It's really hard to review a PR that moves unrelated code between files while also making changes.

Sorry about that! Reverted to previous layout. Right now we do have a nice mix of modules using the "mod" pattern, and modules, that use an upper level file. Maybe we should reconcile that?

While writing more tests, I realized we were missing some type coverage, as well as handling of List vs. LargeList. While covering all permutations would be yet another bigger chunk of work, I tried to cover all cases with "reasonable" effort.

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

scovich · 2025-02-05T14:11:18Z

Right now we do have a nice mix of modules using the "mod" pattern, and modules, that use an upper level file. Maybe we should reconcile that?

Yes, I agree it would good to harmonize our file naming. Should we go with mod.rs or modname.rs tho? Part of me doesn't love a zillion "mod.rs" files (which mod was it again?), but in practice my editor adds the directory it's in to disambiguate so it's not really a problem?

scovich

This is looking really good.

The improvements to the code structure made it easier to see some significant opportunities for additional improvements (deduplication of code, more functionality, etc). Some of those belong in this PR, while others should probably be handled as a follow-on PR.

scovich · 2025-02-05T14:11:52Z

kernel/src/engine/arrow_conversion.rs

@@ -208,7 +208,7 @@ impl TryFrom<&ArrowDataType> for DataType {
            ArrowDataType::Date64 => Ok(DataType::DATE),
            ArrowDataType::Timestamp(TimeUnit::Microsecond, None) => Ok(DataType::TIMESTAMP_NTZ),
            ArrowDataType::Timestamp(TimeUnit::Microsecond, Some(tz))
-                if tz.eq_ignore_ascii_case("utc") =>
+                if tz.eq_ignore_ascii_case("utc") || tz.eq_ignore_ascii_case("+00:00") =>


Did we ever get an answer for this?

scovich · 2025-02-05T14:17:44Z

kernel/src/engine/arrow_expression/in_list.rs

+
+pub(super) fn eval_in_list(


nit: worth a doc comment explaining the approach?

scovich · 2025-02-05T14:18:20Z

kernel/src/engine/arrow_expression/in_list.rs

+
+macro_rules! prim_array_cmp {


Worth a doc comment explaining what this macro is for and how it works?

Honestly, looking at this macro vs. the new code, I'm not sure we come out ahead with the macro? Maybe it's better (as a follow-up) to take the same approach with both lit-col and col-array cases?

scovich · 2025-02-05T14:19:27Z

kernel/src/engine/arrow_expression/in_list.rs

+    batch: &RecordBatch,
+    left: &Expression,
+    right: &Expression,
+) -> Result<Arc<dyn Array>, Error> {


nit

Suggested change

) -> Result<Arc<dyn Array>, Error> {

) -> Result<ArrayRef, Error> {

scovich · 2025-02-05T14:30:39Z

kernel/src/engine/arrow_expression/in_list.rs

+    match (left, right) {
+        (Literal(lit), Column(_)) => {
+            if lit.is_null() {
+                return Ok(Arc::new(BooleanArray::new_null(batch.num_rows())));
+            }
+


A few things --

tiny: IMO it's harder to read a mix of match arms and if/else vs. just adding a second match arm, if the latter is otherwise sensible.

small: Given how two of the match arms are really large, and the others are really small, would it make sense to move the (Literal(lit), Literal(Scalar::Array(ad))) case here as well, for easier finding+reading?

medium : We can widen the NULL-search optimization match arm to cover both columns and literal arrays.

medium: once the null-check is factored out as its own match arm (see above), the existing implementation of the literal-column case trivially covers the (currently missing) column-column case as well (because we use evaluate_expression to create the input array we operate on). Just need to adjust the pattern we match on.

bigger: Every match arm duplicates Ok(Arc::new(...)), and the wrap_comparison_result helper doesn't actually help because it's more typing and adds a layer of indirection. Recommend to strip all that away and just have the match return a BooleanArray that we can wrap once and return.

Suggested change

match (left, right) {

(Literal(lit), Column(_)) => {

if lit.is_null() {

return Ok(Arc::new(BooleanArray::new_null(batch.num_rows())));

}

let result = match (left, right) {

(Literal(Scalar::Null(_)), Column(_) | Literal(Scalar::Array(_))) => {

// Searching any in-list for NULL always returns NULL -- no need to actually search

BooleanArray::new_null(batch.num_rows()

}

(Literal(lit), Literal(Scalar::Array(ad))) => {

// Search the literal in-list once and then replicate the returned single-row result

let exists = is_in_list(ad, Some(Some(lit.clone()))).iter().next();

BooleanArray::from(vec![exists; batch.num_rows()])

}

(Literal(_) | Column(_), Column(_)) => {

and then the match ends with:

(Column(name), Literal(Scalar::Array(ad))) => { ... // safety: as_* methods on arrow arrays panic if the wrong type is requested, // so we always verify the data type first. match (column.data_type(), data_type) { ... } } (l, r) => return Err(Error::invalid_expression(format!( "Invalid right value for (NOT) IN comparison, left is: {l} right is: {r}" ))), }; Ok(Arc::new(result))

scovich · 2025-02-05T14:56:38Z

kernel/src/engine/arrow_expression/in_list.rs

+                (ArrowDataType::Interval(IntervalUnit::YearMonth), IntervalYearMonthType),
+                (ArrowDataType::Interval(IntervalUnit::MonthDayNano), IntervalMonthDayNanoType),
+                (ArrowDataType::Decimal128(_, _), Decimal128Type),
+                (ArrowDataType::Decimal256(_, _), Decimal256Type)


Delta doesn't support 256-bit decimals. The spec says

The precision and scale can be up to 38

... which corresponds to a 128-bit decimal.

scovich · 2025-02-05T16:22:30Z

kernel/src/engine/arrow_expression/in_list.rs

+            let arr = match (column.data_type(), data_type) {
+                (ArrowDataType::Utf8, PrimitiveType::String) => {
+                    is_in_list(ad, str_op(column.as_string::<i32>()))
+                }
+                (ArrowDataType::LargeUtf8, PrimitiveType::String) => {
+                    is_in_list(ad, str_op(column.as_string::<i64>()))
+                }
+                (ArrowDataType::Utf8View, PrimitiveType::String) => {
+                    is_in_list(ad, str_op(column.as_string_view()))
+                }
+                (ArrowDataType::Int8, PrimitiveType::Byte) => {
+                    is_in_list(ad, op::<Int8Type>(&column, Scalar::from))
+                }


aside: This approach does not fully eliminate the risk of a panic, if a flaw in the code produced a mismatch between the data type we manually check for and the cast we later attempt. I'd love it if we could use as_xxx_opt methods instead, if it didn't completely bloat up the code.

Maybe a small non-hygienic column_as! macro could help?

macro magic

See this playground test

macro_rules! column_as { ($what: ident $(:: < $t: ty >)? ) => { paste! { column.[<as_ $what _opt>] $(::<$t>)? () } .ok_or(Error::invalid_expression(format!( "Cannot cast {} to {}", column.data_type(), data_type ))) }; }

(we'd need to upgrade the paste crate from dev-dependency to full dependency)

That would allow:

Suggested change

let arr = match (column.data_type(), data_type) {

(ArrowDataType::Utf8, PrimitiveType::String) => {

is_in_list(ad, str_op(column.as_string::<i32>()))

}

(ArrowDataType::LargeUtf8, PrimitiveType::String) => {

is_in_list(ad, str_op(column.as_string::<i64>()))

}

(ArrowDataType::Utf8View, PrimitiveType::String) => {

is_in_list(ad, str_op(column.as_string_view()))

}

(ArrowDataType::Int8, PrimitiveType::Byte) => {

is_in_list(ad, op::<Int8Type>(&column, Scalar::from))

}

let arr = match (column.data_type(), data_type) {

(ArrowDataType::Utf8, PrimitiveType::String) => {

is_in_list(ad, scalars_from(column_as!(string::<i32>)?))

}

(ArrowDataType::LargeUtf8, PrimitiveType::String) => {

is_in_list(ad, scalars_from(column_as!(string::<i64>)?))

}

(ArrowDataType::Utf8View, PrimitiveType::String) => {

is_in_list(ad, scalars_from(column_as!(string_view)?))

}

(_, PrimitiveType::Byte) => {

is_in_list(ad, scalars_from(column_as!(primitive::<Int8Type>)?))

}

Match arms can ignore the arrow data type when there's a 1:1 relationship (ie for the numeric types), while still leveraging both types in ambiguous cases like string or binary or timestamp. Either way, the result of the column_as! invocation is the ultimate decider of whether the match succeeded.

The str_op and binary_op helpers would be replaced by a single generic scalars_from that takes an iterator of impl Into<Scalar>. Which would also capture e.g. the boolean case that currently is a bit awkward.

The special cases like Date and Timestamp[Ntz], which can't use Scalar::from, would still need today's op (tho maybe we should rename it as to_scalars for clarity):

(ArrowDataType::Date32, PrimitiveType::Date) => { is_in_list(ad, to_scalars(column_as!(primitive::<Date32Type>)?, Scalar::Date)) }

This is a big enough change (and probably relates strongly to the recommendation to harmonize or eliminate the prim_array_cmp! macro) that we should probably tackle both at the same time in a follow-up PR.

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap · 2025-02-06T14:24:12Z

@scovich - made changes according to feedback. The first new commit should cover the changes that should land in this PR. In the second one I gave refactoring the macros a try, after that its just small stuff. In case we feel the macro refactoring needs (considerably) more work, we can move that into a separate PR.

roeap · 2025-02-06T14:29:42Z

kernel/src/engine/arrow_expression/in_list.rs

+            // we should at least cast string / large string arrays to the same type, and may as well see if
+            // we can cast other types.
+            let left_arr = if left_arr.data_type() == list_field.data_type() {
+                left_arr
+            } else {
+                cast(left_arr.as_ref(), list_field.data_type()).map_err(Error::generic_err)?
+            };


while technically not really the right place to cast values, it may actually help us when we start to support type widening. Then again, there values should really be cast when reading the data.

Nonetheless it makes the code simpler if we always try to cast, and I thought its Ok if we do this here?

This relates to the way arrow has multiple physical types for the same logical type? And things like type widening would just come "for free" as a side effect?

Does arrow cast allow narrowing casts tho? If it does, then the comparison could produce wrong results due to information loss. IIRC our type widening code had to actively verify it was a widening cast before attempting the actual cast? Either way, to really get this right we'd need to identify the "wider" of the two sides and cast the other side to match it.

scovich

Looking really good. The only blocker at this point is the question about narrowing casts.

scovich · 2025-02-06T14:45:33Z

kernel/src/engine/arrow_expression/in_list.rs

+            // we should at least cast string / large string arrays to the same type, and may as well see if
+            // we can cast other types.
+            let left_arr = if left_arr.data_type() == list_field.data_type() {
+                left_arr
+            } else {
+                cast(left_arr.as_ref(), list_field.data_type()).map_err(Error::generic_err)?
+            };


This relates to the way arrow has multiple physical types for the same logical type? And things like type widening would just come "for free" as a side effect?

Does arrow cast allow narrowing casts tho? If it does, then the comparison could produce wrong results due to information loss. IIRC our type widening code had to actively verify it was a widening cast before attempting the actual cast? Either way, to really get this right we'd need to identify the "wider" of the two sides and cast the other side to match it.

scovich · 2025-02-06T15:45:27Z

kernel/src/engine/arrow_expression/in_list.rs

+            fn scalars_from<'a>(
+                values: impl IntoIterator<Item = Option<impl Into<Scalar>>> + 'a,
+            ) -> impl IntoIterator<Item = Option<Scalar>> + 'a {
+                values.into_iter().map(|v| v.map(|v| v.into()))


nit: one fewer v to get confused with?

Suggested change

values.into_iter().map(|v| v.map(|v| v.into()))

values.into_iter().map(|v| v.map(Into::into))

scovich · 2025-02-06T16:56:28Z

kernel/src/engine/arrow_expression/in_list.rs

+        (Literal(_) | Column(_), Column(_)) => {
+            let right_arr = evaluate_expression(right, batch, None)?;


nit: Rule of 30 says we should split out these two really fat match arms (~60 and ~100 LoC, respectively) as their own helper functions. Besides improving the readability of the code structure, removing two levels of indentation might convince cargo fmt to be less aggressive with line breaks.

It would also allow cleaner control flow within the logic, because e.g. ? is scoped to just that match arm's body instead of the entire match (see e.g. all those eval_arrow(...)? calls in the match that starts at L71 below, and the return Err for the default case of that same match).

scovich · 2025-02-06T17:02:24Z

kernel/src/engine/arrow_expression/in_list.rs

+                ) => is_in_list(
+                    ad,
+                    to_scalars(
+                        column_as!(primitive::<TimestampMicrosecondType>)?,
+                        Scalar::Timestamp,
+                    ),
+                ),


nit: not sure it's worth the trouble, but could do

Suggested change

) => is_in_list(

ad,

to_scalars(

column_as!(primitive::<TimestampMicrosecondType>)?,

Scalar::Timestamp,

),

),

) => {

let column = column_as!(primitive::<TimestampMicrosecondType>)?;

is_in_list(ad, to_scalars(column, Scalar::Timestamp))

}

(again below)

scovich · 2025-02-06T17:05:22Z

kernel/src/engine/arrow_expression/in_list.rs

+                ($t: ty ) => {
+                    left_arr
+                        .as_primitive_opt::<$t>()
+                        .ok_or(Error::invalid_expression(format!(


Suggested change

.ok_or(Error::invalid_expression(format!(

.ok_or_else(|| Error::invalid_expression(format!(

(otherwise we're creating the error eagerly and just throwing it away most of the time)

(again below)

github-actions bot assigned roeap Jan 18, 2025

roeap force-pushed the feat/col-in-arr branch 2 times, most recently from 007b4e2 to 6977db9 Compare January 18, 2025 16:19

feat: support 'col IN (a, b, c)' type expressions

28a5648

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap force-pushed the feat/col-in-arr branch from 6977db9 to 28a5648 Compare January 18, 2025 16:22

roeap commented Jan 18, 2025

View reviewed changes

roeap requested review from nicklan, scovich, zachschuermann and OussamaSaoudi January 18, 2025 16:32

scovich reviewed Jan 21, 2025

View reviewed changes

roeap added 4 commits January 24, 2025 21:32

Merge branch 'main' into feat/col-in-arr

d6e3730

fix: PR feedback

848ef11

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

chore: clippy

290d65d

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

chore: fmt

def21c1

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap requested a review from scovich January 24, 2025 23:55

scovich reviewed Jan 25, 2025

View reviewed changes

roeap added 3 commits January 30, 2025 09:56

fix: null handling for in-list

6b959eb

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

Merge branch 'main' into feat/col-in-arr

4e3e92c

fix: simplify partial_cmp impl

ba3b4e9

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

scovich reviewed Jan 30, 2025

View reviewed changes

roeap added 2 commits February 2, 2025 14:47

fix: partial_cmp for decimals via rust_decimal

233c0e8

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

fix: in-list null handling

6c813e9

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap commented Feb 2, 2025

View reviewed changes

roeap requested a review from scovich February 2, 2025 17:38

scovich reviewed Feb 4, 2025

View reviewed changes

roeap added 3 commits February 4, 2025 13:33

chore: revert partialeq releated changes

88fd54c

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

chore: cleanup in-list implementation

bff7143

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

chore: clippy

edeacfc

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap requested a review from scovich February 4, 2025 15:00

roeap added 2 commits February 4, 2025 16:26

Merge branch 'main' into feat/col-in-arr

fcdf565

test: add test for in-list false case

fd58d70

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap commented Feb 4, 2025

View reviewed changes

scovich reviewed Feb 4, 2025

View reviewed changes

roeap added 2 commits February 4, 2025 20:46

refactor: split up arrow_expression module

511b451

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

chore: cleanup

00b3002

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap requested a review from scovich February 4, 2025 20:36

chore: clippy

7d9abd4

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

scovich reviewed Feb 5, 2025

View reviewed changes

roeap added 3 commits February 5, 2025 08:58

refactor: move expression mod.rs back into top level file

1062b44

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

fix: make cargo fmt work on in-list file

8ab65b2

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

fix: extend arrow type coverage and tests

f678e7e

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap requested a review from scovich February 5, 2025 12:28

chore: simplify

1ec75e0

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

Merge branch 'main' into feat/col-in-arr

3f2e567

scovich reviewed Feb 5, 2025

View reviewed changes

roeap and others added 6 commits February 5, 2025 22:24

Merge branch 'main' into feat/col-in-arr

58349f2

fix: simplify based on PR feedback

7869610

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

refactor: remove prim_array_cmp macro

46c84f8

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

fix: error message test

eb29de8

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

Merge branch 'main' into feat/col-in-arr

2d04413

fix: clippy

41030fc

Signed-off-by: Robert Pack <robstar.pack@gmail.com>

roeap requested a review from scovich February 6, 2025 14:21

roeap commented Feb 6, 2025

View reviewed changes

scovich reviewed Feb 6, 2025

View reviewed changes

scovich mentioned this pull request Feb 7, 2025

[WIP] Start prototyping support for opaque engine expressions #686

Draft

	inlist.iter().map(\|k\| v.as_ref().map(\|vv\| vv == k)),
	inlist.iter().map(Some(Scalar::partial_cmp(v?, k?)? == Ordering::Equal)),

	return Ok(Arc::new(BooleanArray::from(vec![None; batch.num_rows()])));
	return Ok(Arc::new(BooleanArray::new_null(batch.num_rows())));

	ad,op::<Int8Type>( &column, Scalar::from)
	ad, op::<Int8Type>(&column, Scalar::from)

	// helper function to make arrow in_list* kernel results comliant with SQL NULL semantics.
	// helper function to make arrow in_list* kernel results compliant with SQL NULL semantics.

		// Specifically, if an item is not found in the in-list, but the in-list contains NULLs, the
		// result should be NULL (UNKNOWN) as well.

		let arr = match (column.data_type(), data_type) {
		(ArrowDataType::Utf8, PrimitiveType::String) => is_in_list(

	) -> Result<Arc<dyn Array>, Error> {
	) -> Result<ArrayRef, Error> {

	values.into_iter().map(\|v\| v.map(\|v\| v.into()))
	values.into_iter().map(\|v\| v.map(Into::into))

		(Literal(_) \| Column(_), Column(_)) => {
		let right_arr = evaluate_expression(right, batch, None)?;

	.ok_or(Error::invalid_expression(format!(
	.ok_or_else(\|\| Error::invalid_expression(format!(

feat: support 'col IN (a, b, c)' type expressions #652

Are you sure you want to change the base?

feat: support 'col IN (a, b, c)' type expressions #652

Conversation

roeap commented Jan 18, 2025

What changes are proposed in this pull request?

How was this change tested?

codecov bot commented Jan 18, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

scovich Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

scovich Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap commented Feb 5, 2025

scovich commented Feb 5, 2025

scovich left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 18, 2025 •

edited

Loading

scovich Jan 25, 2025 •

edited

Loading

scovich Jan 25, 2025 •

edited

Loading

scovich Jan 30, 2025 •

edited

Loading

scovich Jan 30, 2025 •

edited

Loading

scovich Jan 30, 2025 •

edited

Loading

scovich Feb 4, 2025 •

edited

Loading

scovich left a comment •

edited

Loading