Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching being too greedy #185

Open
rowlesmr opened this issue Feb 26, 2022 · 3 comments
Open

Matching being too greedy #185

rowlesmr opened this issue Feb 26, 2022 · 3 comments

Comments

@rowlesmr
Copy link

I've just found this library, and I'm trying to implement parsimonious to at least tokenise my input files. There is a grammar I'd like to implement, but the matching seems to be a little too greedy.

One example of this nature is below.

Strings contained in quotes are able to contain the quote character itself. To be a valid string-termination quote mark, the quote mark must be followed by whitespace.

from parsimonious.grammar import Grammar
quote_grammar = Grammar(
    r"""
    SingleQuotedString = single_quote Char* single_quote WhiteSpace
    WhiteSpace = ( SP / HT / EOL )+
    single_quote = "'"  
    
    Char = ~"[a-zA-Z0-9' \t]"
    
    EOL = ~'[\n\r\f]+'
    SP = ' '
    HT = '\t'
    """
)

print(quote_grammar.parse("'don't' "))

This fails with

parsimonious.exceptions.ParseError: Rule 'single_quote' didn't match at '' (line 1, column 9).

Implementing the first rule as a straight regex does work

quote_grammar = Grammar(
    r"""
    SingleQuotedString = ~"['][a-zA-Z0-9' \t]*['][ \t\n\r\f]+"
    """

 -->
<RegexNode called "SingleQuotedString" matching "'don't' ">

Where can I start looking on how to fix this behaviour (if indeed it is a bug...)

@comalice
Copy link

comalice commented Mar 1, 2022

Char* in SingleQuotedString = single_quote Char* single_quote WhiteSpace appears to be the thing that is "too greedy" because your second single_quote has nothing to match on at the end of your string ("column 9", the last index of the string). According to some resources I read

As a regular expression \[.*\] matches the longest possible text between '[' and ']'. As a PEG it never matches anything, because a PEG is deterministic: .* consumes the rest of the input, so \] never matches. As a PEG this needs to be written as: \[ ( !\] . )* \] (or \[ @ \]). [0]

Note that the regular expression does not behave as intended either: in the example * should not be greedy, so \[.*?\] should be used instead.

It looks like you should make use of lookaheads to get the behaviour you need; something like

SingleQuotedString = single_quote (!end_single_quote Char)* end_single_quote
end_single_quote = single_quote Whitespace

[0] https://nim-lang.org/docs/pegs.html

@rowlesmr
Copy link
Author

rowlesmr commented Mar 1, 2022

I see. Make the two token search for "' " into a single token search by making it it's own match.

Ta a lot.

I'll try it out.

@lucaswiman
Copy link
Collaborator

Note that this is regular, so you could do this with just a re.

import re
pattern = re.compile(
    r"""
        (  # start capture group
            (?<!\w)'  # Starting ', which should not be preceded by a word character.
                      # Uses negative lookbehind.
            (?:[^']|'(?=\w))*  # A non-' character, or a ' which is followed by a word character.
                               # Uses positive lookahead.
            '(?!\w)  # Closing ', not followed by a word character.
                     # Uses negative lookahead.
        )  # end capture group
    """,
    flags=re.VERBOSE
)
assert pattern.findall("'do' 'don't'") == ["'do'", "'don't'"]
assert pattern.findall("'don't'") == ["'don't'"]
assert pattern.findall("'don't") == []

(Note also that this general approach may break on some slang words which end in ', like goin'.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants