Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable anchors in regexp string split #6040

Merged

Conversation

anthony-chang
Copy link
Contributor

Closes #4719

This took some time to grok but it appears like Java's implementation for \z, $, and ^ in specifically string split is inconsistent with it's own implementation in find/replace. \z, $, and ^ do not respect new lines/line terminators in string split. Conveniently, \Z and \A in cuDF also do not match line terminators so we can transpile to that.

I also needed to disable repetitions of \A in split because of some fuzzer failures. I think it's possible to support it but I'm not sure how and it isn't a priority.

Signed-off-by: Anthony Chang antchang@nvidia.com

Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

@anthony-chang anthony-chang self-assigned this Jul 20, 2022
@sameerz sameerz added the feature request New feature or request label Jul 21, 2022
Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'm not an expert on the regexp transpile corner cases.

@sameerz sameerz requested review from andygrove and NVnavkumar July 27, 2022 06:02
Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good to me. We could add more tests for \A, \z and \Z but I don't think it's a high priority at the moment.

Clever solution using cuDF \A and \Z, btw!

@anthony-chang anthony-chang merged commit 3aceabc into NVIDIA:branch-22.08 Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] GpuStringSplit: Add support for line and string anchors in regular expressions
4 participants