Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement predicate pruning for like expressions (prefix matching) #12978

Merged
merged 8 commits into from
Dec 30, 2024

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Oct 16, 2024

The idea is that we can push certain like expressions down into statistics pruning.
For example, a filter like url LIKE 'https://www.google.com%' can be (basically, some caveats and other tricks) used to filter url_min <= 'https://www.google.com' and 'https://www.google.com' <= url_max such that a row group that only has https://www.example.com would be excluded.

Closes #507
Closes #13253

@github-actions github-actions bot added the core Core DataFusion crate label Oct 16, 2024
@adriangb
Copy link
Contributor Author

cc @alamb

@alamb
Copy link
Contributor

alamb commented Oct 16, 2024

This is very clever -- I will review it tomorrow

Comment on lines 1499 to 1501
(false, true) => Operator::ILikeMatch,
(true, true) => Operator::NotILikeMatch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dead code as if like_expr.case_insensitive() { catches the case insensitive case

I think this code would be clearer if it just matched on like_expr.negated() (or alternately returned unhandled_hook.handle(expr); directly for these last 2 cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure will change it to do so. I think I was getting a bit ahead of myself to implement ILIKE support, which as per the comment should be possible, maybe you can show me how to construct the physical expression to call lower() and upper() on another expression.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb -- this is very cool.

I think there are a few more tests needed but otherwise the implementation looks very solid. Thank you

@@ -1610,6 +1625,93 @@ fn build_statistics_expr(
Ok(statistics_expr)
}

fn extract_string_literal(expr: &Arc<dyn PhysicalExpr>) -> Result<&String> {
if let Some(lit) = expr.as_any().downcast_ref::<phys_expr::Literal>() {
if let ScalarValue::Utf8(Some(s)) = lit.value() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should probably also handle the cases ScalarValue::LargeUtf8, ScalarValue::Utff8View as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Dictionary!

if prefix.is_empty() {
return plan_err!("Empty prefix in LIKE expression");
}
Ok(Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also have to test if there are other occurences of % in the string 🤔 (like foo%bar%)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is pretty simple (truncate at the first one) but I agree another test would be nice.

// column LIKE '%foo%' => min <= '' && '' <= max => true
// column LIKE 'foo' => min <= 'foo' && 'foo' <= max

// I *think* that ILIKE could be handled by making the min lowercase and max uppercase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree. Figuring out how to make those call would be the trick

fn build_like_match(
expr_builder: &mut PruningExpressionBuilder,
) -> Result<Arc<dyn PhysicalExpr>> {
// column LIKE literal => (min, max) LIKE literal split at % => min <= split literal && split literal <= max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the LIKE literal split at % part

column LIKE literal is the same as column = literal if there are no wild cards, so you should be able to use the same rules as equality I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right that's the point, by splitting it at the first % we are able to apply the same rules as equality:

column LIKE literal -> (min, max) LIKE (literal split at %) -> min <= split literal && split literal <= max
vs
column = literal -> (min, max) = literal -> min <= literal && literal <= max

let expected_ret = &[true, true, false, false, true, true];

prune_with_expr(
// s1 LIKE 'A%'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comments say s1 LIKE A% but the code builds a different expressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I've tried to get all of these comments right, I think they can be removed tbh the expression is pretty self explanatory, but left them for now, let me know if you'd prefer that I remove them or if you want to keep them if there are any obviously wrong

let expected_ret = &[true, true, false, false, true, true];

prune_with_expr(
// s1 LIKE 'A%'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please add tests for other combinations:

  • s1 LIKE 'A'
  • s1 LIKE '%A%'
    I think it is important the matching is doing the right thing

I also think it is important to to cover cases for NOT LIKE as well

let expected_ret = &[true, true, true, true, true, true];

prune_with_expr(
// s1 LIKE 'A%'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// s1 LIKE 'A%'
// s1 LIKE '%A'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is still wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I haven't pushed pending discussing the correctness of the general approach.

@adriangb
Copy link
Contributor Author

adriangb commented Oct 17, 2024

@alamb can these stats be truncated? I know stats in pages truncate large strings, e.g. if the min value is "B" could it be that the actual min value is "BA"? If so I think this approach may not work at all. Imagine we have a row group with data ["BA", "ZD"] which generates min/max stats ["B", "Z"]. Now we want to know if col LIKE '%A%' is possible. Clearly the answer should be yes but if we convert it to the predicate form we get 'B' <= '' AND '' <= 'Z' which gives false 😞. I think this could be resolved by truncating the stats column to be the same length as the prefix?

@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

@adriangb in theory I think parquet statistics can be truncated.

Now we want to know if col LIKE '%A%'

I don't think we can use statistics for substring match -- we can only use statistics for equality and prefix matching

so like col LIKE 'A%'

The predicate would be transformed into 'B' <= 'A' AND 'A' <= 'Z' which I do think is correct

@adriangb
Copy link
Contributor Author

Consider the values ["ABC", "XYZ"] with stats ["AB", "XY"] and the filter col like 'A%'. This becomes 'AB' <= 'A' AND 'A' <= 'XY' which is false, but we need true. To fix this we'd need to truncate the stats to the length of the filter to get 'A' <= 'A' AND 'A' <= 'XY' which then gives the right result. %A% is just an obvious case because you get '' as the prefix which gives obvious issues.

@adriangb
Copy link
Contributor Author

adriangb commented Oct 19, 2024

Okay @alamb I pushed a pretty big rework. Lots of new test cases, lots of comments explaining what's going on. I removed the not like part; I'm thinking this is complex enough as is and most of the benefit (maybe even in ClickBench?) will come from like. We can tackle not like, ilike and not ilike in future PRs. Especially since it's going to be important to evaluate each test case carefully.

I will note that I am a bit concerned about the interaction of truncated stats and how we apply these filters. Take the (absurd) case of stats that were truncated so that all you have is "","". You basically know nothing about the data, there could be anything in there. Yet col = 'A' transforms into '' <= 'A' and 'A' <= '' which is false. Substitute in non-truncated stats and 'A' <= 'A' and 'A' <= 'Z' is true. I added a test case for this behavior on the existing code. It doesn't differentiate between "ABC" truncated to "" and "" actually being the min string but it shows the behavior which would be the same in both cases.

This is important because for the case of a max stat of "A" and a filter col like 'A_' if the stat might have been truncated from "AB" I need to let it pass, if I know for a fact that "A" is the max string in the column I can indeed reject the column.

@adriangb
Copy link
Contributor Author

Argh the only ClickBench query this maybe could improve is 23:

WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%'

Since the like starts with a wildcard this won't help. Maybe we can get not like to do something smart there, tbd...

@Dandandan
Copy link
Contributor

Dandandan commented Oct 19, 2024

I am wondering if simple like patterns are not already converted to startswith etc, such that pruning already is applied or we need to implement that case instead of for like?

@adriangb
Copy link
Contributor Author

I am wondering if simple like patterns are not already converted to startswith etc, such that pruning already is applied.

Good question. Based on performance of some queries I saw I'd say no, but it it's worth double checking. Any suggestions as to a definitive easy way to check? I guess I can run datafusion-cli against a very large parquet file in object storage (high latency) and a query that should filter (col like 'A%') and one that can't?

I don't see where startswith or any other transformations (lower, upper, etc.) are handled in the pruning transformation.

@Dandandan
Copy link
Contributor

I would take a look at the (physical plan) of queries involving like first to see if it still uses like or is transformed to another function.

@adriangb
Copy link
Contributor Author

adriangb commented Oct 19, 2024

I made a big parquet file as follows:

import random
import string
import polars as pl

df = pl.DataFrame({'col': ["A" + "".join(random.choices(string.ascii_letters, k=1_000)) for _ in range(1_000_000)]})
df.write_parquet('data.parquet', compression='uncompressed')

This came out to ~1GB. I then uploaded it to a GCS bucket.

I ran queries col = 'Z' and col like 'Z' against it and got 2s and 23s respectively. IMO that means it's not getting pushed down.

The explain plans reflect that as well:

ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 = Z, pruning_predicate=CASE WHEN col_null_count@2 = col_row_count@3 THEN false ELSE col_min@0 <= Z AND Z <= col_max@1 END, required_guarantees=[col in (Z)], metrics=[output_rows=0, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=19368790, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=3, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=18.748µs, time_elapsed_opening=7.717746249s, time_elapsed_processing=64.457827ms, page_index_eval_time=10.134µs, pushdown_eval_time=20ns, time_elapsed_scanning_total=19.21µs]
ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 LIKE Z, metrics=[output_rows=1000000, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=1006955145, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=49.346124581s, time_elapsed_opening=2.18377s, time_elapsed_processing=1.545583231s, page_index_eval_time=20ns, pushdown_eval_time=20ns, time_elapsed_scanning_total=49.654700084s]

The like query also has:

FilterExec: col@0 LIKE Z, metrics=[output_rows=0, elapsed_compute=1.878551ms]

So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.

I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.

@adriangb adriangb changed the title Implement predicate pruning for LIKE and NOT LIKE Implement predicate pruning for like expressions Oct 19, 2024
@alamb
Copy link
Contributor

alamb commented Oct 21, 2024

So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.

I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.

That certainly makes sense to me

Perhaps we could implement some sort of simplification in https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs for LIKE into starts_with https://docs.rs/datafusion/latest/datafusion/functions/expr_fn/fn.starts_with.html (though since it is a function figuring out how to make the rewrite work might be tricky)

Then we can implement the rules in pruning predicate for starts_with 🤔

@adriangb
Copy link
Contributor Author

adriangb commented Oct 21, 2024

At most we could simplify the col like 'A%' case but we can't simplify 'A%B' so I think it's still worth it to implement pruning rewrites for both.

Do you have any thoughts on my concerns for possibly truncated stats, in particular how = may even be wrong as of today if the stats are truncated enough?

@Dandandan
Copy link
Contributor

I think it’s fine to support like for now and leave the further combination / optimization for future work. I see that only simplifying to starts_with won't get all the benefits.

Of course, the pruning needs to be correct :). We could add (negative) test cases / add issues if the already implemented logic for = pruning is incorrect.

@adriangb
Copy link
Contributor Author

adriangb commented Oct 21, 2024

Well there's no tests for hypothetical cases with truncated stats. All of the tests are against the stats themselves with no indication of how those are meant to correspond with the original data. There were no unit tests of Utf8 filtering at all as far as I can tell.

The current implementation of = is certainly not wrong in the real world, but I'm not sure if that's because it's not used in situations where stats are truncated, if truncation only happens at extremes like a 10MB value where practically it's not a problem, etc.

@alamb
Copy link
Contributor

alamb commented Oct 21, 2024

I suggest we (I can help tomorrow):

  1. File a ticket to simplify like to = and starts_with when possible (to help follow on optimizations like this)
  2. File / find a ticket about testing with truncated statistics
  3. Determining what, if anything, is left to add directly to pruning prediate

@adriangb
Copy link
Contributor Author

Sounds good.

All of that said, I think this PR is currently as correct as = and had pretty good test coverage. Does it need to wait on those tasks or can it proceed in parallel?

let (min_lit, max_lit) = if let Some(wildcard_index) = first_wildcard_index {
let prefix = &s[..wildcard_index];
let prefix_min_lit = Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(
format!("{prefix}\u{10ffff}"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'highest character' should be appended to the max range, not the min.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When implementing similar thing for Trino (trinodb/trino@6aea881), i decided to stay within ASCII characters to avoid any potential issues due to misinterpretation of "difficult" code points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm my intuition was that you want to add the highest character to the upper lower bound of the min value such that 'A%' can match 'AB'. Assuming a column with only 1 row "AB" and the query 'AB' like 'A%':

  • 'AB' <= 'A\u{10ffff}' and 'A' <= 'AB' -> t
  • 'AB' <= 'A' and 'A\u{10ffff}' <= 'AB' -> f
    Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to stay within ASCII? As far as I know everything here is Utf8 so I'm not sure how we can restrict it to ASCII?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if comment is related to the confusing naming in https://github.com/apache/datafusion/pull/12978/files#r1810513072?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to stay within ASCII? As far as I know everything here is Utf8 so I'm not sure how we can restrict it to ASCII?

We can if only we want to. The code will do whatever we ask it to do.
Then the question is whether we want to. If we apply the predicate locally in memory only, then no need to be cautious, no need for "stay within ASCII". If we later interop with other systems (eg send a plan somewhere or TableProvider calls remote system), then it might be beneficial to restrict ourselves.

Hmm my intuition was that you want to add the highest character to the upper lower bound

i agree with this

... of the min value

min value of what?
all column values need to be in the range [like_constant_prefix, like_constant_prefix[0..-1] + \u10ffff)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... of the min value

min value of what?

i get it now

So for a column we have stats: min value and max value. Let's call them col_min and col_max.
For like AB% we derive lower and upper bound (AB and AB\u10ffff which is actually incorrect, will comment about this elsewhere).

For pruning we need to check whether [col_min, col_max] ∩ [lower_bound, upper_bound) is non-empty (note the upper_bound will be non-inclusive)
It's empty when upper_bound <= col_min OR col_max < lower_bound
It's non-empty when upper_bound > col_min AND col_max >= lower_bound

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For correct upper bound and why exclusive see #12978 (comment)

// Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true!
// Credit to https://stackoverflow.com/a/35881551 for inspiration on this approach.
// ANSI SQL specifies two wildcards: % and _. % matches zero or more characters, _ matches exactly one character.
let first_wildcard_index = s.find(['%', '_']);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not support escape characters, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> SELECT '' LIKE '' ESCAPE '%';
Execution error: LIKE does not support escape_char

So I think not.
But just for reference, how would you suggest escape characters be handled? I've never used them in practice (I think at that point I'd just go for a regex).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But just for reference, how would you suggest escape characters be handled?

simple state automata
(eg https://github.com/trinodb/trino/blob/8f279a84bfe7f6f8301a44d26b01c981fc2156f9/core/trino-main/src/main/java/io/trino/type/LikeFunctions.java#L87-L102)

Comment on lines 1661 to 1664
// **IMPORTANT** we need to make sure that the min and max are in the range of the prefix
// If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that
// when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'.
// Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This important, but a bit difficult to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to clarifications on the wording. The point I'm trying to make is why we have to append characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this whole comment. If someone understand how LIKE works, they will get the code without comment.
If someone doesn't understand how LIKE works, they won't understand the code even with the comment.

let min_expr = Arc::new(phys_expr::BinaryExpr::new(
min_column_expr.clone(),
Operator::LtEq,
min_lit,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_lit actually represents max value (upper bound)
would you consider swapping the naming of the variables?

Also, the upper bound has added "Z" ('highest codepoint') at the end, so can be compared with Lt without Eq part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_lit actually represents max value (upper bound)
would you consider swapping the naming of the variables?

I'm open to any suggestions on naming but I do think it is confusing because min_lit is the upper bound on col_min and max_lit is the lower bound on col_max 🤯

Also, the upper bound has added "Z" ('highest codepoint') at the end, so can be compared with Lt without Eq part

👍🏻

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is confusing because min_lit is the upper bound on col_min and max_lit is the lower bound on col_max 🤯

Yes, it is.

i understand now that you're fitting this into terminology of existing code.
i am not sure what the right naming would be. maybe neutral: lower_bound and upper_bound?

Comment on lines 1678 to 1700
let prefix_lit =
Arc::new(phys_expr::Literal::new(ScalarValue::Utf8(Some(s.clone()))));
(prefix_lit.clone(), prefix_lit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case we should produce single Eq check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by a single eq check. We are basically saying col like 'constant' -> col = 'constant' -> col_min <= 'constant' and 'constant' <= col_max

let s = extract_string_literal(scalar_expr)?;
// **IMPORTANT** we need to make sure that the min and max are in the range of the prefix
// If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that
// when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for A% pattern the lower bound is A (obvious)
what should be the upper bound?

A\u{10ffff} is not a correct upper bound since A\u{10ffff}\u{10ffff} is even bigger but still matches A% input.
The correct upper bound would be:

  • A\u{10ffff}\u{10ffff}\u{10ffff}...\u{10ffff} include -- up to max length of the column, so potentially very very long, so absolutely not practical
  • B (exclusive).

Thus to calculate upper bound you need (pseudo-code)

let s = extract_string_literal(scalar_expr)?;
let first_wildcard_index = ...;
let prefix = &s[..wildcard_index];
let last_incrementable_character = /* find last code point of `prefix` that can be incremented
   if we choose to stay within ascii, this will be a code point < 127
   otherwise it will be any code point != the max code point (0x10FFFF) */;
if last_incrementable_character not found {
  // For `%`, or `\u{10ffff}...\u{10ffff}%` patterns, we cannot calculate an upper bound
  return None
}
let upper_bound = 
   prefix[..last_incrementable_character-1] +  // take prefix of the prefix up to  and excluding the last character that can be incremented
   str(prefix[last_incrementable_character] + 1) // take last character and increment it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about this explanation. Maybe you can provide a failing test case that would help me understand? There is already a test case for 'A%' and it is as far as I can tell doing the correct thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try add a test case for 'A%' where stats are min=AB max=A\u{10ffff}\u{10ffff}\u{10ffff}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I added a test case in e29ed50. It worked as expected, let me know if I got the expected outcomes wrong or missed something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #12978 (comment) on how to make the test expose the problem

// If we truncate 'A%' to 'A', we need to make sure that 'A' is less than 'AB' so that
// when we make this a range query we get 'AB' <= 'A\u{10ffff}' AND 'A' <= 'AB'.
// Otherwise 'AB' <= 'A' AND 'A' <= 'AB' would be *wrong* because 'AB' LIKE 'A%' is should be true!
// Credit to https://stackoverflow.com/a/35881551 for inspiration on this approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link isn't useful. That page conveniently avoids any details that are important. Please remove the link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a better resource or name for this transformation? I'm happy to point at any documentation or prior art in Trino.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just not finding this link useful being being inspirational. But source of inspiration doesn't need to be reflected in the code comment 🙂
I am not asking for replacing if any other link.

I personally find trinodb/trino@6aea881 valuable because (1) I know this code as its author and (2) it might actually be correct. Once we get the code here correct, that link wouldn't be useful either.

@adriangb adriangb force-pushed the like-prune branch 2 times, most recently from a0a1c37 to ae3426d Compare October 26, 2024 21:46
@adriangb
Copy link
Contributor Author

@alamb I re-arranged some of the comments on assertions in ae3426d which I feel like helped a lot with readability of the tests. There's a couple other tests with a similar pattern that I think could benefit.

I was also thinking about doing some more black-box testing: I think given any min, max you can always convert that into a RecordBatch with an array of the form [min, max] and you should never have the pruning say the array should be excluded but the array has any matches. Does that make sense? Maybe this could even be fuzz tested?

Comment on lines 1719 to 1722
// Filter out non-characters (https://www.unicode.org/versions/corrigendum9.html)
if [0xFFFE, 0xFFFF].contains(&cp) || (0xFDD0..=0xFDEF).contains(&cp) {
return false;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the link, I'm not sure we need to filter out non-characters. They are allowed to appear in well formed strings, and will sort properly AFAICT. Then this whole function could boil down to

  c as u32 < 0x110000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the link above

Noncharacters in the Unicode Standard are intended for internal use and have no standard interpretation when exchanged outside the context of internal use.

whereas the filter condition we gonna produce here may be exchanged between systems

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. I assumed this predicate was consumed internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually is, but AFAICT it can also reach TableProvider.scan and then ... 🤷‍♂️

Comment on lines 3613 to 3615
// Test surrogate pair range (0xD800..=0xDFFF)
assert_eq!(increment_utf8("a\u{D7FF}").unwrap(), "b");
assert!(increment_utf8("\u{D7FF}").is_none());
Copy link

@etseidl etseidl Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting case. While trying to patch up parquet-rs, I found that a byte-by-byte approach will actually wind up incrementing U+D7FF to U+E000 (which is private use, but still a valid character).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why so? 0xD7FF + 1 would be 0xD800, not 0xE800

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW @etseidl has a PR upstream in arrow-rs related to incrementing utf8:

Maybe we can focus our efforts on making sure we have a solid utf8 increment routine and then use it in both places

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the link.

@alamb i am not against reusing, but note that business logic for Parquet stats and here is different.
For stats, it's enough to find some value bigger than input.
Here, it's mandatory to find next value bigger than input.

i have feeling that the routine in this PR is ready to be merged.
@adriangb can you please rebase the PR?
if a better version emerges, it's not a problem to improve later, but there is no need to tie this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi @alamb feels that this needs some fuzz testing before it can be merged and I heavily agree.

I plan on doing this but have not had time.
Since we have a DataFusion meetup next week in Chicago I've asked for 2 days of dedicated work to spend on this + prepping for the meetup next week. So hopefully I can tackle that Monday and Tuesday.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why so? 0xD7FF + 1 would be 0xD800, not 0xE800

Because 0xD800-0xDFFF are not valid code points. The next higher valid code point is 0xE000. It's a happy coincidence that this happens.

@adriangb
Copy link
Contributor Author

adriangb commented Dec 13, 2024

Hi @alamb I took a stab at fuzz tests in f6c5314.
They're heavy and slow so I had to restrict the search space a lot more than I would have liked. Maybe you or @tustvold can suggest ways to cut out the heavy parsing of Parquet metadata and such to speed these up? Ultimately I do think it's worth re-using whatever creates parquet stats from data so that we use the "real" thing but I don't think we need to test the serialization / deserialization repeatedly like this does.
Also happy to restrict the search space by being more deliberate about how we build the values, row groups, predicates, etc.

@alamb
Copy link
Contributor

alamb commented Dec 15, 2024

Thanks @adriangb -- I have put this PR on my review queue and will attempt to re-review it carefully tomorrow

@alamb
Copy link
Contributor

alamb commented Dec 20, 2024

Clearly I failed to review this -- I will do so hopefully later today but may be tomorrow

@alamb
Copy link
Contributor

alamb commented Dec 23, 2024

This is still on my list, hopefully other people can check it out too

@alamb
Copy link
Contributor

alamb commented Dec 23, 2024

This is my top priority after DF 44 is released:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb -- I think this PR is ready to go

One thing I noticed is that the fuzz test takes over a minute on my machine:

        SLOW [> 60.000s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
        PASS [  65.772s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
------------
     Summary [  72.749s] 47 tests run: 47 passed (1 slow), 0 skipped
andrewlamb@Mac:~/Software/datafusion$

Is there some way to make it faster? Maybe with multiple threads or crank down the number of things to teset?

"~",
"ß",
"℣",
"%", // this one is useful for like/not like tests since it will result in randomly inserted wildcards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

datafusion/core/tests/fuzz_cases/pruning.rs Outdated Show resolved Hide resolved
/// of "fo" that may have originally been "foz" or anything else with the prefix "fo".
/// E.g. `increment_utf8("foo") >= "foo"` and `increment_utf8("foo") >= "fooz"`
/// In this example `increment_utf8("foo") == "fop"
fn increment_utf8(data: &str) -> Option<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be ok to potentially replace this with the implementation from @etseidl in apache/arrow-rs#6870 ?

If so, I can file a ticket to do so as a follow on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed that implementation but yes I think we should consider it!

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@adriangb
Copy link
Contributor Author

Thanks @adriangb -- I think this PR is ready to go

One thing I noticed is that the fuzz test takes over a minute on my machine:

        SLOW [> 60.000s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
        PASS [  65.772s] datafusion::fuzz fuzz_cases::pruning::test_fuzz_utf8
------------
     Summary [  72.749s] 47 tests run: 47 passed (1 slow), 0 skipped
andrewlamb@Mac:~/Software/datafusion$

Is there some way to make it faster? Maybe with multiple threads or crank down the number of things to teset?

Yeah this is what I was hinting at in #12978 (comment).

I'm happy to throw threads at it for a start, and restricting the search space might be necessary but I think requires a more careful eye to minimize how much valuable testing is discarded. The other thing that I think we can do is speed up the tests themselves, in particular minimizing unnecessary round trips to Parquet, but I'm not sure where the right places to hook in would be that still give us a realistic test but remove the need to re-parse the same data over and over again.

@alamb
Copy link
Contributor

alamb commented Dec 28, 2024

Yeah this is what I was hinting at in #12978 (comment).

I'm happy to throw threads at it for a start, and restricting the search space might be necessary but I think requires a more careful eye to minimize how much valuable testing is discarded. The other thing that I think we can do is speed up the tests themselves, in particular minimizing unnecessary round trips to Parquet, but I'm not sure where the right places to hook in would be that still give us a realistic test but remove the need to re-parse the same data over and over again.

Awesome -- I'll try and find time later today or tomorrow to give it a critical eye. Otherwise I'll plan to merge this PR later today or tomorrow as well.

@alamb
Copy link
Contributor

alamb commented Dec 30, 2024

Yeah this is what I was hinting at in #12978 (comment).

I'm happy to throw threads at it for a start, and restricting the search space might be necessary but I think requires a more careful eye to minimize how much valuable testing is discarded. The other thing that I think we can do is speed up the tests themselves, in particular minimizing unnecessary round trips to Parquet, but I'm not sure where the right places to hook in would be that still give us a realistic test but remove the need to re-parse the same data over and over again.

@alamb alamb merged commit fb1d4bc into apache:main Dec 30, 2024
27 checks passed
@alamb alamb added the performance Make DataFusion faster label Dec 30, 2024
@alamb
Copy link
Contributor

alamb commented Dec 30, 2024

Thank you again @adriangb for bearing with us -- I know this took a long time

However, I am pretty stoked that we now have this optimization and it is an example of the very careful engineering required for this kind of optimization. The fact we are at this point in DataFusion is pretty sweet in my mind

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support pruning on string columns using starts_with
6 participants