-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Binary
/LargeBinary
--> Utf8
/LargeUtf8
in ilike and string functions
#7840
Conversation
Binary
/LargeBinary
--> Utf8
/LargeUtf8
coercionBinary
/LargeBinary
--> Utf8
/LargeUtf8
in ilike
dd8f41e
to
b932e1e
Compare
Binary
/LargeBinary
--> Utf8
/LargeUtf8
in ilike Binary
/LargeBinary
--> Utf8
/LargeUtf8
in ilike and string functions
…ring functions fixup
b932e1e
to
56514be
Compare
DataType::LargeUtf8 => $largeUtf8Type, | ||
// LargeBinary inputs are automatically coerced to Utf8 | ||
DataType::LargeBinary => $largeUtf8Type, | ||
DataType::Utf8 => $utf8Type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the fix for string functions taking binary
arguments
The new cases for Binary/LargeBinary
are all that is new here -- the rest is just comments
/// Coercion rules for binary (Binary/LargeBinary) to string (Utf8/LargeUtf8): | ||
/// If one argument is binary and the other is a string then coerce to string | ||
/// (e.g. for `like`) | ||
fn binary_to_string_coercion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the new rule, only used for ilke
I originally tried modifying string_coercion
to also include Binary --> String
coercion, however, that has the (bad) side effect of making a binary = string
comparison into cast(binary)= string
which is bad because:
- The comparison is valid even if the binary column contains non UTF8 data (but the cast will fail in this case)
- The cast from binary --> utf8 is not free (it has to check each value for utf8 correctness)
@@ -681,6 +708,7 @@ fn binary_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> | |||
/// This is a union of string coercion rules and dictionary coercion rules | |||
pub fn like_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | |||
string_coercion(lhs_type, rhs_type) | |||
.or_else(|| binary_to_string_coercion(lhs_type, rhs_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the fix for like
let (lhs, rhs) = get_input_types(&$A_TYPE, &$OP, &$B_TYPE)?; | ||
assert_eq!(lhs, $C_TYPE); | ||
assert_eq!(rhs, $C_TYPE); | ||
($LHS_TYPE:expr, $RHS_TYPE:expr, $OP:expr, $RESULT_TYPE:expr) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same logic, but I renamed the arguments for clarity
|
||
query error DataFusion error: type_coercion | ||
SELECT largebinary FROM t where largebinary LIKE '%F'; | ||
query ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They work 🎉
@jonahgao I wonder if you have some time to review this PR cc @Weijun-H and @JayjeetAtGithub and @parkma99 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I review this PR carefully, a great job!
Thanks @alamb .
@alamb LGTM 👍. I also tested some other string functions on this branch, and they worked as expected.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @alamb 👍
Thanks everyone for the reviews! |
make_utf8_to_return_type!(utf8_to_str_type, DataType::LargeUtf8, DataType::Utf8); | ||
|
||
// `utf8_to_str_type`: returns either a Int32 or Int64 based on the input type size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utf8_to_str_type
-> utf8_to_int_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find -- filed #7861
@@ -648,8 +652,9 @@ fn string_concat_internal_coercion( | |||
} | |||
} | |||
|
|||
/// Coercion rules for Strings: the type that both lhs and rhs can be | |||
/// casted to for the purpose of a string computation | |||
/// Coercion rules for string types (Utf8/LargeUtf8): If at least on argument is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least on argument -> at least one argument
} | ||
} | ||
|
||
/// Coercion rules for binary types (Binary/LargeBinary): If at least on argument is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least on argument -> at least one argument
/// | ||
/// For example, if a function `func` accepts arguments of `(int64, int64)`, | ||
/// but was called with `(int32, int64)`, this function could match the | ||
/// valid_types by by coercing the first argument to `int64`, and would return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid_types by by coercing -> valid_types by coercing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a few typos.
Thank you @viirya -- I will fix the typos |
PR to fix: #7873 |
Which issue does this PR close?
Closes #7342
Closes #7345
Closes #7344
It also closes several draft / PRs:
Closes #7349
Closes #7350
Closes #7365
Rationale for this change
Some of the clickbench queries have this pattern
What changes are included in this PR?
Binary
andLargeBinary
to the appropriateUtf8
(string) variants forLIKE
andILIKE
Are these changes tested?
Yes, new tests
Are there any user-facing changes?
Yes now some queries will not error