Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non ascii characters in regexes #62

Merged
merged 6 commits into from
Aug 1, 2023
Merged

Conversation

vthib
Copy link
Owner

@vthib vthib commented Aug 1, 2023

This is fixing the revert of behavior in YARA.

See VirusTotal/yara#1770 (comment)

Since YARA reverted the change due to existing rules containing such characters, we also want
to do the same. This is however not entirely trivial, for two reasons:

  • The parsing library works on str, and changing this to instead work on &[u8] for the regex parsing is not easy, and i'm not even sure it is desirable.

  • I wanted to add a new warning for such characters, as it can lead to unexpected failure to match properly. This required some rework to be able to generated warnings in code that did not previously.

This is now fixed properly:

  • The parser now allows non ascii chars (except in classes, which also errors in YARA), and pass them in a new AST type.

  • When we convert the regex AST to our internal HIR in boreal, we:

    • generate warnings when the new type is found
    • we properly generate the right HIR (which means some ugly code when handling the "" case, which only applies the repetition to the last byte of the char.

However, at the cost as some code a bit ugly in the AST->HIR conversion, we get the same behavior as YARA, and user friendly warnings on such cases. Overall, a nice change :)

@vthib vthib force-pushed the fix-regex-unicode-parsing branch from 94effae to 11fd279 Compare August 1, 2023 20:21
@vthib vthib force-pushed the fix-regex-unicode-parsing branch from 11fd279 to 47b466d Compare August 1, 2023 20:41
@vthib vthib merged commit adb1dc4 into master Aug 1, 2023
@vthib vthib deleted the fix-regex-unicode-parsing branch August 1, 2023 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant