Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: scalar regex match physical expr #12270

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

zhuliquan
Copy link
Contributor

Which issue does this PR close?

Closes #11146.

Rationale for this change

This PR is successor of PR #11455

BinaryExpr will compile literal regex pattern when it evaluating RecordBatch every time, Sometime, the time of compiling regex pattern is also expensive. In our approach, literal regex pattern will be compiled once and cached to be reused in execution. It's will save compile time of pre execution and speed up execution.

What changes are included in this PR?

  1. Introducing a new physical expr ScalarRegexMatchExpr to handle regexp match with literal regrex pattern.
  2. Introducing a message PhysicalScalarRegexMatchExprNode in proto to handle ScalarRegexMatchExpr and add arm in func parse_physical_expr and serialize_physical_expr.
  3. Changing BinaryExpr arm in create_physical_expr. Creating ScalarRegexMatchExpr instead of BinaryExpr when Rhs is string literal expr and op is RegexMatch | RegexIMatch | RegexNotMatch | RegexNotIMatch.

Are these changes tested?

Yes, test mod in scalar_regex_match.rs

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate proto Related to proto crate labels Aug 31, 2024
@alamb
Copy link
Contributor

alamb commented Sep 7, 2024

Thank you for this PR @zhuliquan . Have you run any benchmarks that show this approach is noticeably faster than the existing approach? It makes sense that it would be faster as it does not re-compile the regular expression for each batch, but I think it would help to quantify this difference

@zhuliquan
Copy link
Contributor Author

zhuliquan commented Sep 17, 2024

Thank you for this PR @zhuliquan . Have you run any benchmarks that show this approach is noticeably faster than the existing approach? It makes sense that it would be faster as it does not re-compile the regular expression for each batch, but I think it would help to quantify this difference

yeah add benchmarks scalar_regex_match.rs which includes scalar_regex_match / binary_regex_match two type cases, these cases are varied from record_batch size and record_batch iterator time and regexp pattern. According to below benchmarks. almost scalar_regex_match is better than binary_expr in preformance of running time, and are even orders of magnitude different.

regex_email_rows_2000_run_time_10/binary_expr_match
                        time:   [4.2756 ms 4.3114 ms 4.3760 ms]
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
regex_email_rows_2000_run_time_10/scalar_regex_match
                        time:   [3.9034 ms 3.9070 ms 3.9114 ms]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

Benchmarking regex_url_rows_2000_run_time_10/binary_expr_match: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.3s, enable flat sampling, or reduce sample count to 60.
regex_url_rows_2000_run_time_10/binary_expr_match
                        time:   [1.0114 ms 1.0134 ms 1.0159 ms]
Found 19 outliers among 100 measurements (19.00%)
  5 (5.00%) high mild
  14 (14.00%) high severe
regex_url_rows_2000_run_time_10/scalar_regex_match
                        time:   [475.42 µs 475.88 µs 476.42 µs]
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

regex_ip_rows_2000_run_time_10/binary_expr_match
                        time:   [718.71 µs 719.81 µs 721.22 µs]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
regex_ip_rows_2000_run_time_10/scalar_regex_match
                        time:   [128.02 µs 132.10 µs 136.87 µs]
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

regex_phone_rows_2000_run_time_10/binary_expr_match
                        time:   [5.7308 ms 5.8108 ms 5.9094 ms]
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe
regex_phone_rows_2000_run_time_10/scalar_regex_match
                        time:   [134.89 µs 139.91 µs 146.52 µs]
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) high mild
  9 (9.00%) high severe

regex_zip_code_rows_2000_run_time_10/binary_expr_match
                        time:   [4.1648 ms 4.1782 ms 4.1936 ms]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
regex_zip_code_rows_2000_run_time_10/scalar_regex_match
                        time:   [126.49 µs 127.81 µs 129.32 µs]
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) high mild
  11 (11.00%) high severe

@zhuliquan zhuliquan force-pushed the feature-scalar_regexp_match_expr branch from d49edca to e9fc6c7 Compare September 17, 2024 17:03
@zhuliquan zhuliquan force-pushed the feature-scalar_regexp_match_expr branch from 9f02ab6 to f1a81a7 Compare October 29, 2024 16:51
@zhuliquan zhuliquan force-pushed the feature-scalar_regexp_match_expr branch from f1a81a7 to 493a47a Compare November 9, 2024 06:52
@github-actions github-actions bot removed the core Core DataFusion crate label Nov 9, 2024
@zhuliquan zhuliquan force-pushed the feature-scalar_regexp_match_expr branch 2 times, most recently from ed8688d to 62f86a5 Compare November 9, 2024 17:35
@zhuliquan zhuliquan force-pushed the feature-scalar_regexp_match_expr branch from b794f95 to 22b5297 Compare December 3, 2024 15:42
@zhuliquan
Copy link
Contributor Author

zhuliquan commented Dec 3, 2024

Thank you for this PR @zhuliquan . Have you run any benchmarks that show this approach is noticeably faster than the existing approach? It makes sense that it would be faster as it does not re-compile the regular expression for each batch, but I think it would help to quantify this difference

Hello @alamb, I have compared my approach to original binary_expr in many regex cases. Do I need to make explanations for benchmark scalar_regex_match?

@Dandandan
Copy link
Contributor

I wonder if we can see improvements on queries in benchmarks with scalar regexes, e.g. clickbench?

@zhuliquan
Copy link
Contributor Author

I wonder if we can see improvements on queries in benchmarks with scalar regexes, e.g. clickbench?

Emm, It means that we should add some regex matching queries in benchmarks first.

@zhuliquan zhuliquan force-pushed the feature-scalar_regexp_match_expr branch from 5f166ec to 0de7a4f Compare December 4, 2024 16:07
@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait common Related to common crate execution Related to the execution crate functions labels Dec 6, 2024
@zhuliquan zhuliquan closed this Dec 6, 2024
tlm365 and others added 11 commits December 15, 2024 23:13
* Optimize performance of  function

Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>

* Add pre-check array is null

* Fix clippy warnings

---------

Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>
Updates the requirements on [prost-build](https://github.com/tokio-rs/prost) to permit the latest version.
- [Release notes](https://github.com/tokio-rs/prost/releases)
- [Changelog](https://github.com/tokio-rs/prost/blob/master/CHANGELOG.md)
- [Commits](tokio-rs/prost@v0.13.3...v0.13.4)

---
updated-dependencies:
- dependency-name: prost-build
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Minor: Output elapsed time for sql logic test
…rojectionMapping` and `EquivalenceGroup` (apache#13675)

* refactor: replace Vec with IndexMap for expression mappings in ProjectionMapping and EquivalenceGroup

* chore

* chore: Fix CI

* chore: comment

* chore: simplify
* fix: Fix parse_sql_expr not handling alias

* cargo fmt

* fix parse_sql_expr example(remove alias)

* add testing

* add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function

* revert change on example `parse_sql_expr`
apache#13730)

Debug trait is useful for understanding what something is and how it's
configured, especially if the implementation is behind dyn trait.
…13660)

* add `unnest_as_table_factor` and `UnnestRelationBuilder`

* unparse unnest as table factor

* fix typo

* add tests for the default configs

* add a static const for unnest_placeholder

* fix tests

* fix tests
@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate functions labels Dec 15, 2024
@github-actions github-actions bot removed documentation Improvements or additions to documentation sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate functions labels Dec 15, 2024
@zhuliquan
Copy link
Contributor Author

I wonder if we can see improvements on queries in benchmarks with scalar regexes, e.g. clickbench?

hello, @Dandandan I have write a benchmark for testing scalar regex match in PR #13789. I got below diff (before: without pre-compiled pattern, after: with pre-compiled pattern)

test email address pattern
                        time:   [14.047 ms 14.155 ms 14.264 ms]
                        change: [+37.888% +41.826% +45.676%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

test ip pattern         time:   [3.5913 ms 3.6025 ms 3.6139 ms]
                        change: [-45.614% -45.332% -45.065%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

test phone number pattern
                        time:   [12.893 ms 13.067 ms 13.303 ms]
                        change: [-51.353% -50.433% -49.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild

test html tag pattern   time:   [13.158 ms 13.491 ms 13.865 ms]
                        change: [+26.127% +29.636% +33.599%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

test url pattern        time:   [12.467 ms 12.594 ms 12.726 ms]
                        change: [+38.072% +42.490% +45.651%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
test date pattern       time:   [12.429 ms 12.523 ms 12.629 ms]
                        change: [-37.932% -37.049% -36.163%] (p = 0.00 < 0.05)
                        Performance has improved.

I'am very confused that some cases have improved and others have regressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is pre-compile pattern string in regexp_match operation