Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing boundaries to separate keywords from another strings #64

Open
farhan443 opened this issue Nov 8, 2021 · 0 comments
Open

Missing boundaries to separate keywords from another strings #64

farhan443 opened this issue Nov 8, 2021 · 0 comments
Labels
bug Something isn't working difficulty: easy

Comments

@farhan443
Copy link
Contributor

farhan443 commented Nov 8, 2021

Many regex patterns in many languages are missing boundaries to separate the keywords from other strings. Which means they can be matched even if they're inside another word.

Example:

Python's regex that matches class keyword:

/class\s*\w+(\(\s*\w+\s*\))?\s*:/

It can match:

  • def upper-class (param):
  • subclass name(param):
  • classroom1(3):
  • classmate__(_):
  • classic(a):

They're not class declarations but they're still get matched because the regex just look whether they contain "class", and doesn't check whether they're surrounded by another letters.

A simple solution would be to surround the keywords with \b. This will prevent them from being matched when next to other word characters ( [A-Za-z0-9_] ). However, they will still get matched if they're next to punctuations.

This can or can't be a problem depending on the language and the punctuation. In JavaScript, any statement can be preceded by a semicolon, because semicolons are used to terminate statements. The same thing might not be the case in other languages.

Another solution which is pretty common is to surround the keywords with \s. This ensures that they can only be surrounded by whitespaces. This brings another problem because now they can't be matched if they're at the start or the end of the line.

An optimal solution would be to use an alternation and a custom character set to manually define the possible separators. e.g., (^|[\s;,]). While this would be effective, it could be harder to implement because you have to know precisely what are the valid positions and/or characters that could surround them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working difficulty: easy
Projects
Status: Todo
Development

No branches or pull requests

2 participants