-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify EXPR LIKE 'constant'
to expr = 'constant'
#13061
Conversation
datafusion/optimizer/Cargo.toml
Outdated
@@ -41,6 +41,7 @@ async-trait = { workspace = true } | |||
chrono = { workspace = true } | |||
datafusion-common = { workspace = true, default-features = true } | |||
datafusion-expr = { workspace = true } | |||
datafusion-functions = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is okay, or if there is a way to avoid it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to make some interface for rewriting the function through 🤔
Perhaps adding a method to SimplifyInfo
like
/// return a new Expr with a function call for exact string prefix max
fn make_starts_with(base: Expr, prefix: Expr) -> Option<Expr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure what you mean. Is this a pattern already in use somewhere I can learn from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was suggest adding a method to https://docs.rs/datafusion/latest/datafusion/logical_expr/simplify/trait.SimplifyInfo.html
However i can't quite figure out how to plumb it in without getting messy
THe trick is that this rewrite is based on the semantcs of starts_with
-- so I think logically it would need to live along side the implementation of starts_with
because it depends on those semantics.
And similarly, to do the rewrite in the pruning predicate we need something similar (somehow to provide a per function rewrite) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm okay I'm happy to work on this more but I fear I am not familiar enough with the relationship between all of these things to figure out how to wire everything up correctly. Maybe we pause this work until someone with more familiarity of these bits of code can suggest a way to structure this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try and help push this forward this week. I am sorry for the delay but I have been super busy the last few days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem 😄
Wow, this is now rewriting |
I thought more about this PR the other day -- given the dependecy you have identified, I wonder if we might actually want to do the opposite Specifically rewrite starts_with(x, 'foo') into x LIKE 'foo%' With the rationale that I'll try and find a ticket to track the idea |
Hmm interesting. The dependency is a bit strange to me, I'm not sure why functions would depend on the optimizer, I would think the optimizer could depend on functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriangb what would you think about updating this PR to handle the no wild card case.
The more I think about the prefix case the less I am convinced that translating to some other function would be valuable. Instead I am thinking maybe we can ensure the LIKE
evaluation is fast (perhaps it can detect this case and invoke a better kernel) 🤔
To handle ONLY the no wildcard case? Happy to do that if that's all the progress we can make for now. |
if index == pattern.len() - 1 && !case_insensitive && escape_char.is_none() { | ||
let prefix = pattern[..index].to_string(); | ||
let new_expr = Expr::ScalarFunction(ScalarFunction { | ||
func: datafusion_functions::string::starts_with(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today a downstream project can choose which functions to use, this is configurable in the session, right? The DF repertoire of functions serves as a convenient package, but not mandatory.
Here we're inserting implicit function call to a specific function.
Is this following a pre-existing pattern? if so, i am OK with this.
But maybe it would be better to eg consult session state whether this function was supposed to be used. What do you think?
@@ -1446,18 +1446,54 @@ impl<'a, S: SimplifyInfo> TreeNodeRewriter for Simplifier<'a, S> { | |||
|
|||
// Rules for Like | |||
Expr::Like(Like { | |||
expr, | |||
pattern, | |||
expr: ref like_expr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep expr
as expr
use like_expr
when referring to the whole Expr::Like
thing
if pattern == "%" && !is_null(like_expr) { | ||
return Ok(Transformed::yes(lit(!negated))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool idea, but is_null(like_expr)
checks makes it incorrect.
This function returns "is this always null". Applied to a column reference, it will return false, so this optimization will kick in, but the column values are still nullable.
The "more correct" way would be to
- optimize
WHERE col LIKE '%'
toWHERE col IS NOT NULL
- optimize
WHERE col NOT LIKE '%'
toWHERE false
.
However, this simplification cannot be done here. It can only be done for top-level filter conjuncts, or when expr being simplified is immediately used as a condition (eg IF(expr, ...)
). In other cases, this need to be NULL-preserving boolean logic:
trino> SELECT a, a LIKE '%', a NOT LIKE '%'
-> FROM (VALUES NULL, '', 'x') t(a);
a | _col1 | _col2
------+-------+-------
NULL | NULL | NULL
| true | false
x | true | false
(3 rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The easiest solution would be to just drop this special case.
I.e. expr LIKE '%'
would get simplified to starts_with(expr, '')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #12637 (comment)
let pct_wildcard_index = pattern.find('%'); | ||
let underscore_index = pattern.find('_'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honor escape_char
|
||
// Handle single % at end case (prefix search) | ||
if let Some(index) = pct_wildcard_index { | ||
if index == pattern.len() - 1 && !case_insensitive && escape_char.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&& underscore_index.is_none()
@@ -3589,6 +3635,18 @@ mod tests { | |||
let expr = not_ilike(col("c1"), "%"); | |||
assert_eq!(simplify(expr), lit(false)); | |||
|
|||
let expr = like(col("c1"), "a%"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test cases
-
with
%
pattern -
with
_
pattern (can be simplified to a length check, but that can be follow up) -
with
%%
and%%%
patterns (can be simplified just like single%
) -
with
__
and___
patterns (can be simplified to a length check, but that can be follow up) -
with patterns combining % and _, eg
a_b%
,a%b_
(these cannot simplified) and_%
,%_
(these can be simplified to a length check, but that can be follow up)
do we have equivalent of #12978 for starts_with? |
EXPR LIKE 'constant'
to expr = 'constant'
if !is_null(&like_expr.expr) && pattern_str == "%" { | ||
Transformed::yes(lit(!like_expr.negated)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct. Did you consider #13061 (comment) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was already here (see diff above). Did I change the logic in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Can you please confirm or disconfirm the correctness concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is incorrect but this is the logic that was there before this PR. I'm happy to remove it. I guess no one has complained because the case that breaks is SELECT NULL LIKE '%'
which isn't something that happens under normal usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is incorrect but this is the logic that was there before this PR.
thanks for confirming and understood it's pre-existing.
Can it be fixed though?
i personally would prefer that git annotate
on the file wouldn't point at me as the author of something that's known to be incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My latest commit removes that special case. I'm not sure how to fix it, your comment suggests it can't be fixed right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess no one has complained because the case that breaks is
SELECT NULL LIKE '%'
which isn't something that happens under normal usage.
actually the opposite.
this was correct only for this case
and wrong for all other ... LIKE '%'
queries.
see #13259
} else if !pattern_str.contains(['%', '_'].as_ref()) { | ||
// If the pattern does not contain any wildcards, we can simplify the like expression to an equality expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add TODO to handle escaped search characters in a follow-up.
Escaped wildcards are not uncommon in certain scenarios - eg correct use of JDBC DatabaseMetaData will produce LIKE foo\_bar
when inspecting object foo_bar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it not simplify if there is an escape character for now
let expr = like(col("c1"), "a_"); | ||
assert_eq!(simplify(expr), col("c1").like(lit("a_"))); | ||
let expr = not_like(col("c1"), "a_"); | ||
assert_eq!(simplify(expr), col("c1").not_like(lit("a_"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test cases where pattern is constant NULL (it won't be simplified today, that's fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 6c847fa
Transformed::yes(lit(!like_expr.negated)) | ||
} else if !pattern_str.contains(['%', '_'].as_ref()) { | ||
// If the pattern does not contain any wildcards, we can simplify the like expression to an equality expression | ||
if like_expr.escape_char.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can this condition be checked on the same line with !pattern_str.contains(['%', '_']
it would reduce code branching.
Expr::Literal(ScalarValue::Utf8(Some(pattern_str))) if pattern_str == "%" | ||
) => | ||
{ | ||
Transformed::yes(lit(!negated)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this is correct, but requires updating some test cases.
i am doing this in #13259
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, I'll wait for your work to be merged then continue here
Closing in favor of #13260 |
#12978 (comment)
Closes #13192