Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] regexp: \d, \w inconsistencies with non-latin unicode input #5530

Closed
andygrove opened this issue May 18, 2022 · 1 comment · Fixed by #5541
Closed

[BUG] regexp: \d, \w inconsistencies with non-latin unicode input #5530

andygrove opened this issue May 18, 2022 · 1 comment · Fixed by #5541
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@andygrove
Copy link
Contributor

andygrove commented May 18, 2022

Describe the bug

Updating our existing fuzz tests to use the full range of unicode characters for input data has exposed some issues with \d, and \w. Everything seems to work fine so far for the upper case versions \D, and \W.

javaPattern=\w, cudfPattern=\w, input='鈻瑜㶯眀', cpu=false, gpu=true
javaPattern=..\d, cudfPattern=[^\n\r\u0085\u2028\u2029][^\n\r\u0085\u2028\u2029]\d, input='䤫畍킱곂⬡❽ࢅ獰᳌蛫青', cpu=false, gpu=true

Steps/Code to reproduce bug
See above.

Expected behavior
Behavior should be consistent between CPU and GPU or we should fall back to CPU.

Environment details (please complete the following information)
Failed in CI.

Additional context
None

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 18, 2022
@andygrove andygrove added this to the May 2 - May 20 milestone May 18, 2022
@andygrove andygrove self-assigned this May 18, 2022
@andygrove andygrove changed the title [BUG] \s, \d, \w inconsitencies with non-latin unicode input [BUG] \s, \d, \w inconsistencies with non-latin unicode input May 18, 2022
@andygrove andygrove changed the title [BUG] \s, \d, \w inconsistencies with non-latin unicode input [BUG] regexp: \s, \d, \w inconsistencies with non-latin unicode input May 18, 2022
@andygrove andygrove changed the title [BUG] regexp: \s, \d, \w inconsistencies with non-latin unicode input [BUG] regexp: \d, \w inconsistencies with non-latin unicode input May 18, 2022
@andygrove
Copy link
Contributor Author

Related cuDF issue: rapidsai/cudf#10894

However, we can work around this particular bug by just transpiling \d to [0-9]

@andygrove andygrove added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label May 19, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants