-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching being too greedy #185
Comments
It looks like you should make use of lookaheads to get the behaviour you need; something like
|
I see. Make the two token search for "' " into a single token search by making it it's own match. Ta a lot. I'll try it out. |
Note that this is regular, so you could do this with just a import re
pattern = re.compile(
r"""
( # start capture group
(?<!\w)' # Starting ', which should not be preceded by a word character.
# Uses negative lookbehind.
(?:[^']|'(?=\w))* # A non-' character, or a ' which is followed by a word character.
# Uses positive lookahead.
'(?!\w) # Closing ', not followed by a word character.
# Uses negative lookahead.
) # end capture group
""",
flags=re.VERBOSE
)
assert pattern.findall("'do' 'don't'") == ["'do'", "'don't'"]
assert pattern.findall("'don't'") == ["'don't'"]
assert pattern.findall("'don't") == [] (Note also that this general approach may break on some slang words which end in ', like goin'.) |
I've just found this library, and I'm trying to implement parsimonious to at least tokenise my input files. There is a grammar I'd like to implement, but the matching seems to be a little too greedy.
One example of this nature is below.
Strings contained in quotes are able to contain the quote character itself. To be a valid string-termination quote mark, the quote mark must be followed by whitespace.
This fails with
Implementing the first rule as a straight regex does work
Where can I start looking on how to fix this behaviour (if indeed it is a bug...)
The text was updated successfully, but these errors were encountered: