Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fall back to CPU for unsupported regular expression edge cases with end of line/string anchors and newlines #5610

Merged
merged 34 commits into from
Jun 3, 2022

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented May 24, 2022

Closes #5525

Through expanded fuzz testing, we recently discovered some edge cases that produce different results between CPU and GPU for regular expressions patterns with an end of line anchor $ immediately next to a newline, begin-of-line anchor ^ or repetition that could produce empty results.

New code is added in this PR that checks for these unsupported patterns so that we can fall back to CPU for these cases.

  • Implement checks
  • Update tests
  • Update compatibility guide

Note: The checks for these edge cases are very broad and result in some regression and false positives which are outlined here #5659

Signed-off-by: Andy Grove <andygrove@nvidia.com>
// maybe these are ok because the $ is at the end of the pattern?
// "\r$", "\f$", "\u0085$", "\u2028$", "\u2029$", "\n$", "\r\n$", "[\r\n]?$"
// not sure about these ...
// "$\r", "$\f", "\\00*[D$3]$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NVnavkumar handled the $\r case in #5289, which transpiles to \r(?:[\n\u0085\u2028\u2029])?$, and similarly for \u0085, \u2028, \u2029. The other two don't have line terminators so I don't think there is a problem

…scala

Co-authored-by: Anthony Chang <54450499+anthony-chang@users.noreply.github.com>
@andygrove andygrove added this to the May 23 - Jun 3 milestone May 25, 2022
@andygrove andygrove self-assigned this May 25, 2022
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

@anthony-chang this is passing now but we'll need a follow-on issue to add the characters that I removed from the fuzzer to make this pass in CI

anthony-chang
anthony-chang previously approved these changes Jun 1, 2022
Copy link
Contributor

@anthony-chang anthony-chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthony-chang this is passing now but we'll need a follow-on issue to add the characters that I removed from the fuzzer to make this pass in CI

Created follow-up issue: #5711

// these would get transpiled to negated character classes
// that include newlines
true
case RegexCharacterClass(true, _) => true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there is an edge case here that can prevent this from being a newline, and that being a negated character class that includes all new line characters. I see no need to handle this at the moment, but should be noted in case this potentially comes up in some testing scenarios.

NVnavkumar
NVnavkumar previously approved these changes Jun 2, 2022
Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove andygrove marked this pull request as draft June 2, 2022 16:07
@andygrove
Copy link
Contributor Author

@NVnavkumar @anthony-chang I will retarget this to 22.08

@andygrove andygrove changed the base branch from branch-22.06 to branch-22.08 June 2, 2022 16:08
@andygrove andygrove dismissed stale reviews from NVnavkumar and anthony-chang June 2, 2022 16:08

The base branch was changed.

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove marked this pull request as ready for review June 2, 2022 16:36
@andygrove andygrove merged commit d81e501 into NVIDIA:branch-22.08 Jun 3, 2022
@andygrove andygrove deleted the handle-regexp-edge-cases branch June 3, 2022 16:13
HaoYang670 pushed a commit to HaoYang670/spark-rapids that referenced this pull request Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Investigate more edge cases in regexp support
5 participants