Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support FixedSizeList Type Coercion #9108

Merged
merged 21 commits into from
Feb 26, 2024

Conversation

Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Feb 2, 2024

Which issue does this PR close?

Follow #8902

Rationale for this change

For future maintenance, extend Array which specialized Signature for ArrayEmpty and similar functions

What changes are included in this PR?

  • The signatures of array functions with one parameter to use Signature::array

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Feb 2, 2024
@github-actions github-actions bot added the physical-expr Physical Expressions label Feb 2, 2024
@github-actions github-actions bot removed the physical-expr Physical Expressions label Feb 2, 2024
@github-actions github-actions bot added the physical-expr Physical Expressions label Feb 7, 2024
@Weijun-H Weijun-H force-pushed the type-coercion branch 2 times, most recently from 68f8d21 to 2f477b2 Compare February 8, 2024 02:54
Comment on lines 4340 to 4386
query error
select array_ndims(1)

query error
select array_ndims(null)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow PostgreSQL

@Weijun-H Weijun-H marked this pull request as ready for review February 8, 2024 03:21
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H -- this is looking good. I have a question about the inner functions and the need for allow_null_coercion

@@ -117,7 +122,8 @@ pub enum TypeSignature {
/// is `OneOf(vec![Any(0), VariadicAny])`.
OneOf(Vec<TypeSignature>),
/// Specifies Signatures for array functions
ArraySignature(ArrayFunctionSignature),
/// Boolean value specifies whether null type coercion is allowed
ArraySignature(ArrayFunctionSignature, bool),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand this comment to explain what null type coercion is and when it would make sense notn to be allowed?

I don't understand what it is used for 🤔 it seems to me like null coercion should always be allowed

pub fn get_type_signature(
&self,
current_types: &[DataType],
allow_null_coercion: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are passing allow_null_coercion here, maybe it would make sense to have allow_null_coercion to be part of each array enum? Perhaps something like

impl ArrayFunctionSignature {
    ArrayAndElement(allow_null_coercion),
    ...
    Array(allow_null_coercion)
}

But again I am not sure about the need for allow_null_coercion 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_null_coercion determines if the signature can allow a null type. For example, empty(NULL) accepts NULL while array_dims(Null) will return error

@@ -80,73 +78,6 @@ fn get_valid_types(
signature: &TypeSignature,
current_types: &[DataType],
) -> Result<Vec<Vec<DataType>>> {
fn array_append_or_prepend_valid_types(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- so one part of this PR is extracting this function into ArrayFunctionSignature ?

current_types: &[DataType],
allow_null_coercion: bool,
) -> Result<Vec<Vec<DataType>>> {
fn array_append_or_prepend_valid_types(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason to create these inner functions, rather than actual functions (or maybe just inline the body of the functions into the match statement below)?

I find the extra level of indenting somewhat confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only for the match statement below

datafusion/expr/src/signature.rs Show resolved Hide resolved
};

// We follow Postgres on `array_append(Null, T)`, which is not valid.
if array_type.eq(&DataType::Null) && !allow_null_coercion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need allow_null_coercion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is to determine if the signature could accept NULL

This comment was marked as outdated.

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any array function's signature is array_and_element(true, )? If not, it is better not to introduce allow_null_coercion for array_and_element now.

Copy link
Member Author

@Weijun-H Weijun-H Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array_has , array_positions, array_remove and array_remove_allshould use it.

D SELECT list_contains(null, 1);
┌────────────────────────┐
│ list_contains(NULL, 1) │
│         int32          │
├────────────────────────┤
│                        │
└────────────────────────┘

D SELECT list_position(null, 1);
┌────────────────────────┐
│ list_position(NULL, 1) │
│         int32          │
├────────────────────────┤
│                        │
└────────────────────────┘

**Schema (PostgreSQL v15)**

---

**Query #1**

    SELECT array_remove(NULL, 2);

| array_remove |
| ------------ |
|              |

@jayzhan211

This comment was marked as outdated.

Ok(vec![vec![array_type]])
}
DataType::Null => Ok(vec![vec![array_type.clone()]]),
_ => Ok(vec![vec![DataType::List(Arc::new(Field::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return error if not List

@Weijun-H Weijun-H marked this pull request as draft February 17, 2024 11:59
@Weijun-H Weijun-H marked this pull request as ready for review February 21, 2024 11:15
@Weijun-H Weijun-H requested a review from jayzhan211 February 21, 2024 11:16
@@ -433,6 +435,7 @@ pub fn array_element(args: &[ArrayRef]) -> Result<ArrayRef> {
let indexes = as_int64_array(&args[1])?;
general_array_element::<i64>(array, indexes)
}
DataType::Null => Ok(args[0].clone()),
Copy link
Contributor

@jayzhan211 jayzhan211 Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to handle null here (any other array functions)? Is it like pre-checking? If it is, note that array_element(null, ) should be an error.

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need functions like pre-checking, or validation for checking the input of the Array function when we move to UDF-based implementation 🤔 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think handling null here makes it confusing where it is nulls handled, is it handled incoercion step coerce signature or here (array_expressions). The test in array.slt only process once. We can only know it is handled correctly, but unsure which one does the job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think handling null here makes it confusing where it is nulls handled, is it handled incoercion step coerce signature or here (array_expressions). The test in array.slt only process once. We can only know it is handled correctly, but unsure which one does the job.

This is what I am considering now, in coercion step could filter if the function accepts the NULL type while in array_expressions it determines what the function should return @jayzhan211 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in coercion step could filter if the function accepts the NULL type while in array_expressions it determines what the function should return

Does it mean we still need to do the null handling in array_expression? Take array_append for example, if we find null, we convert it to i64(None) and process it later on.

ideally, nulls should be handled in type coercion. And, array_expression can be simple that just focuses on processing the result and not handling type or signature-related issues.

If doing the checking again in array_expression is just because it is a public function, I think we should come out with a way that the checking can be somewhat reused in both places. Before that, maybe we just leave the checking in coercion step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in coercion step could filter if the function accepts the NULL type while in array_expressions it determines what the function should return

Does it mean we still need to do the null handling in array_expression? Take array_append for example, if we find null, we convert it to i64(None) and process it later on.

ideally, nulls should be handled in type coercion. And, array_expression can be simple that just focuses on processing the result and not handling type or signature-related issues.

It would be ideal if we could handle null in one place. However, array functions have different results for NULL. For instance, array_append(null, 1) returns an error, whereas array_has(null, 1) returns null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that in postgres, array_append(null, 1) return {1}, array_append(null, ARRAY[1]) is error. 😕

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can ignore nulls as the first argument since we are not coercing them to any other type. Unlike, array_append(arr, null) as the second argument, in this case, we will convert to other primitives.

array() for fixed list to list type coercion only.

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe our primary focus within TypeCoercionRewriter should be strictly on type coercion. If encountering a null value that isn't being cast to another type, it should be disregarded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe our primary focus within TypeCoercionRewriter should be strictly on type coercion. If encountering a null value that isn't being cast to another type, it should be disregarded.

I cannot agree more with that. I should focus on the type coercion, and leave NULL aside now. So I will remove allow_null_coercion and don't check null now.

| DataType::LargeList(_)
| DataType::FixedSizeList(_, _) => {
let array_type = coerced_fixed_size_list_to_list(array_type);
Ok(vec![vec![array_type, DataType::Int64]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it can be further simplied if there is a function that process coerced_fixed_size_list_to_list and return array_type. Then can be used both in array_and_index and array

/// `current_types` - The data types of the arguments
/// `allow_null_coercion` - Whether null type coercion is allowed
/// Returns the valid types for the function signature
pub fn get_type_signature(
Copy link
Contributor

@jayzhan211 jayzhan211 Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think we need to place all signature functions into get_type_signature.
Instead, having separate signature functions. And, collect what we need in the matching function.

This way the function can be reused easily. For example,

fn array() return array_type.
match {
 signature::array => { vec![vec![array()]] }
  signature::array_and_index => { vec![vec![array(), i64]]  }
}

@Weijun-H Weijun-H marked this pull request as draft February 22, 2024 14:32
@github-actions github-actions bot removed the physical-expr Physical Expressions label Feb 25, 2024
@Weijun-H Weijun-H marked this pull request as ready for review February 25, 2024 14:40
@Weijun-H Weijun-H requested a review from jayzhan211 February 25, 2024 14:40
_ => Ok(vec![vec![]]),
}
}
fn array_and_index(current_types: &[DataType]) -> Result<Vec<Vec<DataType>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn array_and_index(current_types: &[DataType]) -> Result<Vec<Vec<DataType>>> {
fn array(array_type: &DataType) -> Option<DataType> {
match array_type {
DataType::List(_)
| DataType::LargeList(_)
| DataType::FixedSizeList(_, _) => {
let array_type = coerced_fixed_size_list_to_list(array_type);
Some(array_type)
}
_ => None,
}
}
match self {
ArrayFunctionSignature::ArrayAndElement => {
array_append_or_prepend_valid_types(current_types, true)
}
ArrayFunctionSignature::ElementAndArray => {
array_append_or_prepend_valid_types(current_types, false)
}
ArrayFunctionSignature::ArrayAndIndex => {
if current_types.len() != 2 {
return Ok(vec![vec![]]);
}
let array_type = array(&current_types[0]);
if let Some(array_type) = array_type {
Ok(vec![vec![array_type, DataType::Int64]])
} else {
Ok(vec![])
}
}
ArrayFunctionSignature::ArrayAndElementAndOptionalIndex => {
array_element_and_optional_index(current_types)
}
ArrayFunctionSignature::Array => {
if current_types.len() != 1 {
return Ok(vec![vec![]]);
}
let array_type = array(&current_types[0]);
if let Some(array_type) = array_type {
Ok(vec![vec![array_type]])
} else {
Ok(vec![])
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to return the data type for array instead of long but useless Result<Vec<Vec<DataType>>>

}

impl std::fmt::Display for ArrayFunctionSignature {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
ArrayFunctionSignature::ArrayAndElement => {
write!(f, "array, element")
write!(f, "ArrayAndElement")
}
ArrayFunctionSignature::ArrayAndElementAndOptionalIndex => {
write!(f, "array, element, [index]")

This comment was marked as outdated.

}

impl std::fmt::Display for ArrayFunctionSignature {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
ArrayFunctionSignature::ArrayAndElement => {
write!(f, "array, element")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change?

@Weijun-H Weijun-H requested a review from jayzhan211 February 26, 2024 02:47
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Weijun-H

}
array(&current_types[0]).map_or_else(
|| Ok(vec![vec![]]),
|array_type| Ok(vec![vec![array_type, DataType::Int64]]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

datafusion/expr/src/type_coercion/functions.rs Outdated Show resolved Hide resolved
@jayzhan211
Copy link
Contributor

@Weijun-H, thanks again!

@jayzhan211 jayzhan211 merged commit b728232 into apache:main Feb 26, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants