Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters support in regex #233

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

kunchtler
Copy link

Linked to #232.

Changed the order in which the tokens are registered in the regex lexer to process the rule about recognizing letters last, and changed that rule to account for all non-blank characters (as specified per python's re library with \S).

Added a test to check for the support of non-ascii characters.

This is my very first pull request ever so feel free to guide me.

@coveralls
Copy link

Coverage Status

coverage: 99.613%. remained the same
when pulling c7b94b1 on kunchtler:unicode-regex
into 9ab1a1c on caleb531:develop.

@eliotwrobson eliotwrobson linked an issue Jul 27, 2024 that may be closed by this pull request
Copy link
Collaborator

@eliotwrobson eliotwrobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunchtler thanks for this! One request to make this test a little more robust, but overall I think the change looks good.

def test_validate_unicode_characters(self) -> None:
"""Should pass validation for regular expressions with unicode characters."""
re.validate("(µ|🤖ù)*")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a test that an nfa converted from this regex has the expected set of input symbols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode characters with regexp ?
3 participants