Matching being too greedy #185

rowlesmr · 2022-02-26T14:21:53Z

I've just found this library, and I'm trying to implement parsimonious to at least tokenise my input files. There is a grammar I'd like to implement, but the matching seems to be a little too greedy.

One example of this nature is below.

Strings contained in quotes are able to contain the quote character itself. To be a valid string-termination quote mark, the quote mark must be followed by whitespace.

from parsimonious.grammar import Grammar
quote_grammar = Grammar(
    r"""
    SingleQuotedString = single_quote Char* single_quote WhiteSpace
    WhiteSpace = ( SP / HT / EOL )+
    single_quote = "'"  
    
    Char = ~"[a-zA-Z0-9' \t]"
    
    EOL = ~'[\n\r\f]+'
    SP = ' '
    HT = '\t'
    """
)

print(quote_grammar.parse("'don't' "))

This fails with

parsimonious.exceptions.ParseError: Rule 'single_quote' didn't match at '' (line 1, column 9).

Implementing the first rule as a straight regex does work

quote_grammar = Grammar(
    r"""
    SingleQuotedString = ~"['][a-zA-Z0-9' \t]*['][ \t\n\r\f]+"
    """

 -->
<RegexNode called "SingleQuotedString" matching "'don't' ">

Where can I start looking on how to fix this behaviour (if indeed it is a bug...)

The text was updated successfully, but these errors were encountered:

comalice · 2022-03-01T15:33:02Z

Char* in SingleQuotedString = single_quote Char* single_quote WhiteSpace appears to be the thing that is "too greedy" because your second single_quote has nothing to match on at the end of your string ("column 9", the last index of the string). According to some resources I read

As a regular expression \[.*\] matches the longest possible text between '[' and ']'. As a PEG it never matches anything, because a PEG is deterministic: .* consumes the rest of the input, so \] never matches. As a PEG this needs to be written as: \[ ( !\] . )* \] (or \[ @ \]). [0]

Note that the regular expression does not behave as intended either: in the example * should not be greedy, so \[.*?\] should be used instead.

It looks like you should make use of lookaheads to get the behaviour you need; something like

SingleQuotedString = single_quote (!end_single_quote Char)* end_single_quote
end_single_quote = single_quote Whitespace

[0] https://nim-lang.org/docs/pegs.html

rowlesmr · 2022-03-01T23:27:18Z

I see. Make the two token search for "' " into a single token search by making it it's own match.

Ta a lot.

I'll try it out.

lucaswiman · 2022-03-01T23:42:39Z

Note that this is regular, so you could do this with just a re.

import re
pattern = re.compile(
    r"""
        (  # start capture group
            (?<!\w)'  # Starting ', which should not be preceded by a word character.
                      # Uses negative lookbehind.
            (?:[^']|'(?=\w))*  # A non-' character, or a ' which is followed by a word character.
                               # Uses positive lookahead.
            '(?!\w)  # Closing ', not followed by a word character.
                     # Uses negative lookahead.
        )  # end capture group
    """,
    flags=re.VERBOSE
)
assert pattern.findall("'do' 'don't'") == ["'do'", "'don't'"]
assert pattern.findall("'don't'") == ["'don't'"]
assert pattern.findall("'don't") == []

(Note also that this general approach may break on some slang words which end in ', like goin'.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching being too greedy #185

Matching being too greedy #185

rowlesmr commented Feb 26, 2022

comalice commented Mar 1, 2022

rowlesmr commented Mar 1, 2022

lucaswiman commented Mar 1, 2022

Matching being too greedy #185

Matching being too greedy #185

Comments

rowlesmr commented Feb 26, 2022

comalice commented Mar 1, 2022

rowlesmr commented Mar 1, 2022

lucaswiman commented Mar 1, 2022