Allow non ascii characters in regexes #62
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is fixing the revert of behavior in YARA.
See VirusTotal/yara#1770 (comment)
Since YARA reverted the change due to existing rules containing such characters, we also want
to do the same. This is however not entirely trivial, for two reasons:
The parsing library works on
str
, and changing this to instead work on&[u8]
for the regex parsing is not easy, and i'm not even sure it is desirable.I wanted to add a new warning for such characters, as it can lead to unexpected failure to match properly. This required some rework to be able to generated warnings in code that did not previously.
This is now fixed properly:
The parser now allows non ascii chars (except in classes, which also errors in YARA), and pass them in a new AST type.
When we convert the regex AST to our internal HIR in boreal, we:
However, at the cost as some code a bit ugly in the AST->HIR conversion, we get the same behavior as YARA, and user friendly warnings on such cases. Overall, a nice change :)