Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Unicode Character Classes in Brackets #71

Merged
merged 1 commit into from
Nov 10, 2023

Conversation

raskad
Copy link
Collaborator

@raskad raskad commented Nov 4, 2023

This PR implements Unicode Character Classes in Brackets.

Copy link
Owner

@ridiculousfish ridiculousfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good, thank you! However rather than tracking the unicode properties as an extra field(s) in BracketContents, we should instead replace the character class with its associated ranges. That goes for both CharacterClass and UnicodePropertyEscape. That will both speed up and simplify matching.

Probably the place to do it is immediately, during parse(), so that we don't have to carry around the character classes in the IR.

You can make that change as part of this, or we can defer it and do it later, and you can land what you have.

@raskad
Copy link
Collaborator Author

raskad commented Nov 10, 2023

This change looks good, thank you! However rather than tracking the unicode properties as an extra field(s) in BracketContents, we should instead replace the character class with its associated ranges. That goes for both CharacterClass and UnicodePropertyEscape. That will both speed up and simplify matching.

Probably the place to do it is immediately, during parse(), so that we don't have to carry around the character classes in the IR.

Very nice idea. That would probably also enable some range optimizations in some cases. My only concern would be that that would copy the static range tables and in some cases potentially increase the size of the compiled regex a lot.

I will test that out on a different branch :)

@raskad raskad merged commit df9730f into master Nov 10, 2023
7 checks passed
@raskad raskad deleted the character_class_brackets branch November 10, 2023 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants