PhysicalExpr Orderings with Range Information #10504

berkaysynnada · 2024-05-14T12:55:43Z

Which issue does this PR close?

Closes #9832.

Rationale for this change

There exists a bug in CastExpr orderings, as detailed in the issue. To solve this problem, we need to append DataType information of children in the get_ordering method of PhysicalExpr's. However, this problem can be solved in a more clever way offering more functionality.

I propose to extend get_ordering method to get_properties such that it can also carry the range information of expressions, which could have an important meaning for expressions when investigating the order information.

Having range information would help ScalarFunctionExpr's order calculations since many of them have monotonicity pattern on some defined intervals. I have given an example of it for ABS() function. CastExpr bug is also solved since the range carries the datatype information.

What changes are included in this PR?

Function monotonicity in scalar functions should not be set via a boolean vector, wrt the effect of the parameter on function monotonicity. Instead, it should be hardcoded in the function bodies, as most functions have unique monotonicity patterns and need to decide order based on their input properties. (I have updated the existing monotonicity calculations and add some more, but there are still TODO's)
To be able to use ScalarFunctionExpr's in interval arithmetic, we will have also evaluate_bounds() and propagate_constraints() methods in ScalarUDF.
Minor updates in interval_arithmetic.rs

Storing unbounded unsigned integers as [0 - Null]
Or support for boolean intervals
Arithmetic negation of intervals

Are these changes tested?

Yes, in order.slt

Are there any user-facing changes?

Update configs.md

alamb · 2024-05-14T19:56:55Z

I haven't had a chance to review this PR yet @berkaysynnada -- I wonder if you have seen the API in #10117 from @tinfoil-knight

datafusion/expr/src/udf.rs

berkaysynnada · 2024-05-15T07:01:02Z

I haven't had a chance to review this PR yet @berkaysynnada -- I wonder if you have seen the API in #10117 from @tinfoil-knight

Yes, I have. It is becoming a nicer and more useful API for the current version of monotonicity. However, I believe we will eventually need to carry this range data along with the expressions to be able to perform deeper analyses and optimizations.

My initial motivation was just to solve the CastExpr bug, but then I realized that this bug is part of a larger need. I am, of course, open to iterating over this PR.

ozankabak · 2024-05-15T13:47:57Z

~~That PR and this are orthogonal. The work in this PR will simply inherit/benefit from the refactor in the other PR.~~

@berkaysynnada kindly reminded me that the type undergoing the refactor disappears in this extended formulation. In that case this PR may supersede the refactor one.

I will review this one in detail tomorrow and expand further.

berkaysynnada · 2024-05-15T13:58:20Z

datafusion/core/src/physical_optimizer/enforce_distribution.rs

+            ("a".to_string(), "a".to_string()),
+            ("b".to_string(), "b".to_string()),
+            ("c".to_string(), "c".to_string()),
+        ];


A test bug shows up, fixed here.

alamb · 2024-05-15T17:45:36Z

@berkaysynnada kindly reminded me that the type undergoing the refactor disappears in this extended formulation. In that case this PR may supersede the refactor one.

Indeed, though I still see

    /// Calculates the [`SortProperties`] of this function based on its children's properties.
    fn monotonicity(&self, _input: &[ExprProperties]) -> Result<SortProperties> {
        Ok(SortProperties::Unordered)
    }

In this PR

alamb · 2024-05-15T17:46:57Z

Having range information would help ScalarFunctionExpr's order calculations since many of them have monotonicity pattern on some defined intervals. I have given an example of it for ABS() function.

This is a pretty clever idea. I think as long as we ensure that it is easy to implement the common patterns of montonically increasing and montonically decreasing adding a more general API is great

ozankabak

I reviewed this carefully and sent two commits improving on it. This is a fantastic PR that hits two birds in one stone:

It fixes the CAST bug,
It enables us to perform more optimizations w.r.t. monotonicity by utilizing range information (and obviating the need for a type like FuncMonotonicity in the process).

I only have a few asks and then we can merge this:

The monotonicity definition example in advanced_udf.rs seems to be removed. If possible, let's add it back in to serve as an example to users.
Similar comment for function_factory.rs.
If all the inputs of a function are Singleton, should the output SortProperties also be Singleton? If so, maybe we can do this check once before calling individual monotonicity functions.

@alamb, it'd be good if you could take a quick look before we merge (but not necessary as I reviewed carefully). FYI, defining an "all-increasing" function becomes a one-liner in this approach, so your point is incorporated in this design as well.

alamb · 2024-05-17T00:33:06Z

@alamb, it'd be good if you could take a quick look before we merge (but not necessary as I reviewed carefully). FYI, defining an "all-increasing" function becomes a one-liner in this approach, so your point is incorporated in this design as well.

Sounds awesome -- I will try and do so tomorrow but if I don't get a chance feel free to merge it and I'll review afterwards. Thank you @ozankabak and @berkaysynnada

berkaysynnada · 2024-05-17T08:47:23Z

3. If all the inputs of a function are Singleton, should the output SortProperties also be Singleton? If so, maybe we can do this check once before calling individual monotonicity functions.

For the most functions, it is so, but there could be edge cases like some time dependent functions or somehow depending randomly generated values. We have discussed it and decided to keep it as it is. I have addressed other minor issues, and this PR is ready to be merged.

ozankabak · 2024-05-17T11:07:57Z

I will go ahead and merge this and we will fix any issues with a quick follow-on PR in case @alamb discovers any when he has time to take a look.

alamb

Thanks @berkaysynnada -- I looked at the API a little bit more today, and I think we may want to consider renaming ScalarUDF::monotonicity to better reflect what it does

The overall idea of allowing scalar UDFs to be part of the interval /boundary analysis is really nice

alamb · 2024-05-21T10:36:29Z

datafusion/expr/src/udf.rs

+
+    /// Calculates the [`SortProperties`] of this function based on its
+    /// children's properties.
+    fn monotonicity(&self, _inputs: &[ExprProperties]) -> Result<SortProperties> {


Before we release this PR, I wonder if we should call this function sort_properties or calculate_sort, rather than monotonicty

Calling it monotonicty may cause upgrade pain as the signature (types) changed from

/// This function specifies monotonicity behaviors for User defined scalar functions. fn monotonicity(&self) -> Result<Option<FuncMonotonicity>> { Ok(None) }

It also seems like monotonicity doesn't accurately reflect what it does any more, though I may misunderstand

Thanks for the review. I can rename it as calculate_order if you think it is explanatory enough?

How about just order() or output_ordering() (to mirror ExecutionPlan::output_ordering)?

alamb · 2024-05-21T10:41:33Z

datafusion/functions/src/math/monotonicity.rs

+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::datatypes::DataType;


I didn't see many (any?) tests for this new code

Specifically, I am imagining tests like "if we broke / introduced a bug in one of these implementations would a test fail"

Did I miss them?

You are right. These functions were not tested with the old API, and they haven't been tested now either. Should we add unit tests for them in this file, or would it be better to cover them in the .slt tests?

Ideally I would recommend .slt tests (that show sorts/not sorts for exmaple), but I am not sure if you have sufficient bandwidth to do so. Maybe unit tests would be best.

* Self review * Fix null interval accumulation * Refactor monotonicity * Ignore failing tests * Initial impl * Ready for review * Update properties.rs * Update configs.md Update configs.md * cargo doc * Add abs test * Update properties.rs * Update udf.rs * Review Part 1 * Review Part 2 * Minor --------- Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>

berkaysynnada added 13 commits May 6, 2024 18:05

Self review

e2a791e

Fix null interval accumulation

6036e34

Refactor monotonicity

a960b13

Ignore failing tests

4c18de7

Initial impl

b1ec16d

Merge branch 'apache_main' into feature/expr-range

b866f73

Ready for review

befeb7f

Update properties.rs

9124a13

Merge branch 'apache_main' into feature/expr-range

bc1fefd

Update configs.md

1eb7a8e

Update configs.md

cargo doc

1071f60

Add abs test

7466620

Update properties.rs

d67b0af

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels May 14, 2024

alamb reviewed May 14, 2024

View reviewed changes

datafusion/expr/src/udf.rs Outdated Show resolved Hide resolved

Update udf.rs

903459a

berkaysynnada commented May 15, 2024

View reviewed changes

alamb mentioned this pull request May 15, 2024

DataFusion weekly project plan (Andrew Lamb) - May 13, 2024 #10482

Closed

8 tasks

ozankabak added 2 commits May 16, 2024 15:04

Review Part 1

7e2f91c

Review Part 2

cbdd33e

ozankabak approved these changes May 16, 2024

View reviewed changes

Minor

f0f3e9e

ozankabak merged commit d2fb05e into apache:main May 17, 2024
23 checks passed

alamb mentioned this pull request May 20, 2024

DataFusion weekly project plan (Andrew Lamb) - May 20, 2024 #10579

Closed

10 tasks

alamb reviewed May 21, 2024

View reviewed changes

This was referenced May 21, 2024

improve monotonicity api #10117

Closed

Request: Improve Monotoniciy API #9879

Closed

This was referenced May 21, 2024

Expand Test Coverage for ScalarUDF's #10595

Open

Rename monotonicity as output_ordering in ScalarUDF's #10596

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PhysicalExpr Orderings with Range Information #10504

PhysicalExpr Orderings with Range Information #10504

berkaysynnada commented May 14, 2024 •

edited

Loading

alamb commented May 14, 2024

berkaysynnada commented May 15, 2024

ozankabak commented May 15, 2024 •

edited

Loading

berkaysynnada May 15, 2024 •

edited

Loading

alamb commented May 15, 2024 •

edited

Loading

alamb commented May 15, 2024

ozankabak left a comment •

edited

Loading

alamb commented May 17, 2024

berkaysynnada commented May 17, 2024

ozankabak commented May 17, 2024

alamb left a comment

alamb May 21, 2024

berkaysynnada May 21, 2024

alamb May 21, 2024

alamb May 21, 2024

berkaysynnada May 21, 2024

alamb May 21, 2024

PhysicalExpr Orderings with Range Information #10504

PhysicalExpr Orderings with Range Information #10504

Conversation

berkaysynnada commented May 14, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented May 14, 2024

berkaysynnada commented May 15, 2024

ozankabak commented May 15, 2024 • edited Loading

berkaysynnada May 15, 2024 • edited Loading

Choose a reason for hiding this comment

alamb commented May 15, 2024 • edited Loading

alamb commented May 15, 2024

ozankabak left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented May 17, 2024

berkaysynnada commented May 17, 2024

ozankabak commented May 17, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb May 21, 2024

Choose a reason for hiding this comment

berkaysynnada May 21, 2024

Choose a reason for hiding this comment

alamb May 21, 2024

Choose a reason for hiding this comment

alamb May 21, 2024

Choose a reason for hiding this comment

berkaysynnada May 21, 2024

Choose a reason for hiding this comment

alamb May 21, 2024

Choose a reason for hiding this comment

berkaysynnada commented May 14, 2024 •

edited

Loading

ozankabak commented May 15, 2024 •

edited

Loading

berkaysynnada May 15, 2024 •

edited

Loading

alamb commented May 15, 2024 •

edited

Loading

ozankabak left a comment •

edited

Loading