performance issue when parsing => grammar not good? #106

cons0l3 · 2021-04-14T08:17:26Z

cons0l3
Apr 14, 2021

Dear all,

first of all: nice work! keep it up.

I am currently working on an expression parser (arpeggio 1.10.1) that mimics python expressions:

from arpeggio import *
from arpeggio import RegExMatch as _

class SuppressStrMatch(StrMatch):
    suppress = True

def comment():          return [_("//.*"), _("/\*.*\*/")] 

def expression():       return Optional(disjunction), EOF

def disjunction():      return [(conjunction, OneOrMore( Kwd("OR"), conjunction ) ), conjunction]
def conjunction():      return [(inversion, OneOrMore( Kwd("AND"), inversion) ), inversion]
def inversion():        return [ (Kwd('NOT'), inversion), comparison]
def comparison():       return [ compare_op, plus]
def compare_op():       return [ binaryOp, notin, element_in, isnot, element_is ]
def binaryOp():         return plus, ['==','!=','<=','<','>=','>'], plus
def notin():            return plus, 'not', 'in', SuppressStrMatch("["), disjunction, ZeroOrMore(SuppressStrMatch(","), disjunction), SuppressStrMatch("]")
def element_in():       return plus, 'in', SuppressStrMatch("["), disjunction, ZeroOrMore(SuppressStrMatch(","), disjunction), SuppressStrMatch("]")
def isnot():            return plus, 'is', 'not', plus
def element_is():       return plus, 'is', plus
def plus():             return term, ZeroOrMore(['+', '-'], term)
def term():             return factor, ZeroOrMore(['*', '/'], factor)
def factor():           return Optional(['+','-']), [value, parenthesis]
def value():            return [literal, symbol]

def literal():          return [numericLiteral, stringLiteral, constantLiteral ]
def numericLiteral():   return _(r'\d*\.\d*|\d+') #(E(+|-)?\d+)?
def stringLiteral():    return [_(r'".*?"'), _(r"'.*?'")]
def constantLiteral():  return [_(r"True"), _(r"False"), _(r"NULL")]
def symbol():           return _(r"\w+")
def parenthesis():      return SuppressStrMatch("("), disjunction, SuppressStrMatch(")")

ExpressionParser:ParserPython = ParserPython(expression, comment, debug=False, reduce_tree=True, ignore_case=True)

when I execute a parse:
expression = ExpressionParser.parse("""(p in [37,47] and (c in ["A", "P"] or l==true)) or (p in [75,76,361,390,391] and c in ["A", "P"])""")

it takes about 8 seconds. that is way to long. I would like it to be way below 1 second for parsing such a simple expression.

could anyone assist in explaing or advising how to fix the grammar? (which is directly derived from the peg parser of python itself and translated to arpeggio style)

thanks
carsten

Thanks for pushing me in the right direction

cons0l3 · 2021-04-14T08:25:35Z

cons0l3
Apr 14, 2021
Author

oh, and I have really aweful exceptipons when the parsed string has syntax issues. it is always stack exceptions and not the nice exception "hey at position 3 after the "or" something is not right..." kind of deal.

0 replies

igordejanovic · 2021-04-14T09:09:13Z

igordejanovic
Apr 14, 2021
Maintainer

For these kinds of grammars where there is a lot of backtracking (expressions grammars are a good example) you should turn on memoization which can yield a massive speed improvements.

Here is a test on my machine:

import time
# Without memoization
ExpressionParser = ParserPython(expression, comment, debug=False, reduce_tree=True,
                                ignore_case=True)
exp_str = """(p in [37,47] and (c in ["A", "P"] or l==true))
             or (p in [75,76,361,390,391] and c in ["A", "P"])"""
start = time.process_time()
exp_nomem = ExpressionParser.parse(exp_str)
print(exp_nomem.tree_str())
print("TIME:", time.process_time() - start, "\n")

# With memoization
ExpressionParser = ParserPython(expression, comment, debug=False, reduce_tree=True,
                                ignore_case=True, memoization=True)
start = time.process_time()
exp_mem = ExpressionParser.parse(exp_str)
print(exp_mem.tree_str())
print("TIME:", time.process_time() - start)

Output:

expression=Sequence [1-110]
  disjunction=OrderedChoice [1-108]
    conjunction=OrderedChoice [1-45]
      element_in=Sequence [1-12]
        symbol=RegExMatch(\w+) [1-2]: p
...
  EOF [110-110]: 
TIME: 9.193519810000002 

expression=Sequence [1-110]
  disjunction=OrderedChoice [1-108]
    conjunction=OrderedChoice [1-45]
      element_in=Sequence [1-12]
...
  EOF [110-110]: 
TIME: 0.004547567999999558

As you can see, without memoization it takes over 9 secs while with memoization it takes only 4 ms.

For handling exception see this part of the docs

0 replies

igordejanovic · 2021-04-14T09:14:04Z

igordejanovic
Apr 14, 2021
Maintainer

TLDR; for exception the easy way to get just the error message without stacktrace is to catch NoMatch exception and just print it.

0 replies

cons0l3 · 2021-04-14T09:15:42Z

cons0l3
Apr 14, 2021
Author

thanks igor! it seems i can also improve performance by rewriting a few rules. but memotization already improved significantly.

0 replies

cons0l3 · 2021-04-14T09:23:34Z

cons0l3
Apr 14, 2021
Author

you are my hero of the day!

0 replies

cons0l3 · 2021-04-14T11:31:55Z

cons0l3
Apr 14, 2021
Author

I rewrote the first few rules to be:

def disjunction():      return conjunction, ZeroOrMore( Kwd("OR"), conjunction )
def conjunction():      return inversion, ZeroOrMore( Kwd("AND"), inversion)
def inversion():        return Optional(Kwd('NOT')), comparison
def comparison():       return [ binaryCompareOp, notin, element_in, isnot, element_is, plus ]
def binaryCompareOp():  return plus, ['==','!=','<=','<','>=','>'], plus

and that alone helped considerably plus Igors tipp.

Problem solved.

0 replies

igordejanovic · 2021-04-14T12:05:05Z

igordejanovic
Apr 14, 2021
Maintainer

I'm glad it solved your problem. I'll leave the issue open labeled with question to be easier to find.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance issue when parsing => grammar not good? #106

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

performance issue when parsing => grammar not good? #106

cons0l3 Apr 14, 2021

Replies: 7 comments

cons0l3 Apr 14, 2021 Author

igordejanovic Apr 14, 2021 Maintainer

igordejanovic Apr 14, 2021 Maintainer

cons0l3 Apr 14, 2021 Author

cons0l3 Apr 14, 2021 Author

cons0l3 Apr 14, 2021 Author

igordejanovic Apr 14, 2021 Maintainer

cons0l3
Apr 14, 2021

cons0l3
Apr 14, 2021
Author

igordejanovic
Apr 14, 2021
Maintainer

igordejanovic
Apr 14, 2021
Maintainer

cons0l3
Apr 14, 2021
Author

cons0l3
Apr 14, 2021
Author

cons0l3
Apr 14, 2021
Author

igordejanovic
Apr 14, 2021
Maintainer