-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fall back to CPU for unsupported regular expression edge cases with end of line/string anchors and newlines #5610
Fall back to CPU for unsupported regular expression edge cases with end of line/string anchors and newlines #5610
Conversation
Signed-off-by: Andy Grove <andygrove@nvidia.com>
4063aaf
to
79883d4
Compare
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
// maybe these are ok because the $ is at the end of the pattern? | ||
// "\r$", "\f$", "\u0085$", "\u2028$", "\u2029$", "\n$", "\r\n$", "[\r\n]?$" | ||
// not sure about these ... | ||
// "$\r", "$\f", "\\00*[D$3]$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NVnavkumar handled the $\r
case in #5289, which transpiles to \r(?:[\n\u0085\u2028\u2029])?$
, and similarly for \u0085
, \u2028
, \u2029
. The other two don't have line terminators so I don't think there is a problem
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
…scala Co-authored-by: Anthony Chang <54450499+anthony-chang@users.noreply.github.com>
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionTranspilerSuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Anthony Chang <antchang@nvidia.com>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
…-case' into handle-regexp-edge-cases
build |
build |
build |
build |
build |
@anthony-chang this is passing now but we'll need a follow-on issue to add the characters that I removed from the fuzzer to make this pass in CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anthony-chang this is passing now but we'll need a follow-on issue to add the characters that I removed from the fuzzer to make this pass in CI
Created follow-up issue: #5711
// these would get transpiled to negated character classes | ||
// that include newlines | ||
true | ||
case RegexCharacterClass(true, _) => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there is an edge case here that can prevent this from being a newline, and that being a negated character class that includes all new line characters. I see no need to handle this at the moment, but should be noted in case this potentially comes up in some testing scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@NVnavkumar @anthony-chang I will retarget this to 22.08 |
The base branch was changed.
build |
…nd of line/string anchors and newlines (NVIDIA#5610)
Closes #5525
Through expanded fuzz testing, we recently discovered some edge cases that produce different results between CPU and GPU for regular expressions patterns with an end of line anchor
$
immediately next to a newline, begin-of-line anchor^
or repetition that could produce empty results.New code is added in this PR that checks for these unsupported patterns so that we can fall back to CPU for these cases.
Note: The checks for these edge cases are very broad and result in some regression and false positives which are outlined here #5659