Finalize grammar composition story #30

erikrose · 2013-05-08T00:43:39Z

Reconsider dict-like grammars:

However, pros:

You can .update() them to compose. OTOH, won't we have to redo a bunch of compilation and optimization on update? That seems like it should be named something more explicit, like compose(). [Ed: update() is not a very good parallel for the type of thing extends will do; extends maintains Grammar boundaries (for delegation), while update() just paves over keys, one at a time.]
You can enumerate the rules without getting repr and such back.
You can use Python keywords as rule names.
Rule names don't collide with other Grammar method names.

If we were to store expressions on methods of a grammar, we could change or alias Expression.parse to Expression.__call__.

The text was updated successfully, but these errors were encountered:

erikrose · 2014-07-12T07:25:32Z

Moved from a code comment:

We can either have extending callers pull the rule text out of repr, or we could get fancy and define __add__ on Grammars and strings. Or maybe, if you want to extend a grammar, just prepend (or append?) your string to its, and yours will take precedence. Or use the OMeta delegation syntax.

erikrose · 2014-07-12T07:28:14Z

I guess we'll have to go with something heavier, like __add__, since the introduction of custom rules means Grammars are no longer completely round-trippable as strings.

erikrose · 2014-07-13T19:46:08Z

One nice thing about spitting out a new class for each Grammar is that we could use super() to implement parent-rule delegation. But OTOH, I don't want to get married to simply calling other rules; it stomps on the possibility for optimization.

Strawman:

foo = goof ball
foo = my_addition / foo^  # delegates to previously described foo rule

erikrose · 2014-07-14T03:20:17Z

We'll use the rule_name^ syntax to delegate to a supergrammar's rule:

super = Grammar("""
    hi_mom = greeting " mom"
    greeting = "yo" / "hi" / "sup"
    """)

sub = Grammar("""
    greeting = "hello" / greeting^
    """,
    extends=super)

sub.parse('hello mom')

We don't override __add__ and do grammar1 + grammar2, because the second addend wouldn't be a valid Grammar, having unresolveable delegations and possibly other references to nonexistent rules.

extends can take either a single grammar or an iterable of them. Multiple supers at once are necessary: for a simple case, what if you wanted to just make a grammar that's an OR of two others? You'd need to have both around, or you'd get unresolved reference errors.

We should move the custom rules into a custom kwarg [done] so we don't continually have to be backwards-incompatible (or use weird names) every time we add a kwarg. We had to add one just 2 days after introducing custom rules. Custom rules are a pretty niche case and not worth wrecking our flexibility over.

The question arose whether to support delegation to non-uniquely named rules within a single grammar:

greeting = "hi"
greeting = greeting^ / "hello"

I believe this is a bad idea. It causes the order of rules within a grammar to matter in one special case; this isn't true more generally and could lead to confusion. For instance, if the user views grammar addition as concatenation of the string-based rules, he will be surprised when foo references the second instance of bar rather than the first, when these two are added together:

bar = "heigh"
foo = bar

bar = "ho"

Thus, order of rules will not matter except that later rules of a given name within a single grammar will shadow earlier ones. (Subgrammar rules of a given name will also shadow rules of that name from the supergrammar, except for when ^ is used.)

Delegated-to but otherwise shadowed supergrammar rules remain only as anonymous sub-Expressions of the delegating rule. We strip the name off so the subgrammar's visitor method receives the output of only the new rule. If it likes, the method can delegate to the supergrammar's visitation code as appropriate. (As a bonus, this keeps multiple same-named rules from existing in a grammar, which would be un-dict-like and make reprs inaccurate.)

To implement, we will pass zero or more supergrammars into RuleVisitor at initialization. A visit_delegation() grabs any reference to uncompiled (that is, having references unresolved) supergrammar rules off the supergrammar at visit time and sets their names to empty. (We will need to start keeping uncompiled versions of all unshadowed expressions around in grammars.) Existing reference resolution code then stitches everything together normally.

Custom rules should not be a problem. They either reference other rules lazily, by dereferencing the grammar at parse time (and so are not in danger of accidentally referencing an overridden supergrammar rule), or they contain LazyReferences, which should resolve just as in any other Expression. Like with other Expressions, we will keep uncompiled versions of them around to allow for grammar extension. We might add some facility for custom rules to do delegation: perhaps a DelegatedReference akin to a LazyReference.

erikrose · 2014-07-14T04:57:58Z

@halst, I wouldn't mind your feedback on the above.

erikrose · 2014-07-14T18:35:11Z

Consider that OMeta's delegation syntax is just "super"; that is, it doesn't allow delegation to supergrammar rules other than the one you're currently overriding. Is it useful to be able to do so, or should we simplify and make our syntax shorter?

keleshev · 2014-07-14T19:58:59Z

extends=... is super. I was about to implement + for my peg.rb and you made a compelling case not to do that.

Another thing I was thinking, inheritance is good, but maybe there is a place for composition. For example:

version = Grammar('''
    version <- digit+ "." digit+ "." digit+
    digit <- [0-9]''')

semantic_version = Grammar('''
    version <- basic_version ("-" meta_data)? ("+" meta_data)?
    meta_data <- [0-9A-Za-z-]+''',
    basic_version = version)

erikrose · 2014-07-14T20:45:37Z

In that case, you might also like the critique of __add__ that I didn't post:

Syntactically, we'll override __add__. This is mostly because a = b + c is more memorable than something like a = b.extended_with(c), and I like the += shortcut. I could be persuaded otherwise, however. Many properties of addition on other domains are not true over Grammars: grammar addition is not commutative, not divisible (summing all the rules together is different than summing the grammars, which are sort of sums of rules), and perhaps not associative. Hmm. So it turns out I've argued myself to neutrality on that point.

keleshev · 2014-07-14T20:50:28Z

So it turns out I've argued myself to neutrality on that point.

😄

erikrose · 2014-07-15T03:17:04Z

You raise a good point about composition. Sometimes it's nice to overlay the namespaces and override rules; other times, it's nice to compose and let the grammar we're delegating to encapsulate its rules and keep the relationships between them. If nothing else, it would be nice to know that rules I add to my subgrammar won't collide with ones added in the future by the supergrammar. (Heh, most programming languages don't even do anything to help with that in subclassing.) At any rate, this requires more design thought.

erikrose · 2014-07-15T05:15:18Z

You're not going to believe this. It just so happens that, because of the way custom rules are built, we get composition for free. For example, let's compose 2 wiki grammars to make an uber-wiki parser that tries multiple dialects, one after the other, until one works:

Grammar("""
    any_wiki = mediawiki / freakywiki
    """,
    # Now we do, in essence, "from mediawiki_grammar import markup"--and the
    # same for freakywiki:
    custom={'mediawiki': mediawiki_grammar['markup'],
            'freakywiki': freakywiki_grammar['markup']})

The external grammars already have all their references resolved, so there's no danger of _resolve_refs() plowing through them and stitching rules defined in the new grammar into them. We don't have to do anything to the DSL. It's beautiful. Do you see any holes?

(We could even teach custom rules to take Grammars directly and go grab their default rules, meaning we wouldn't need to say ['markup'].)

erikrose · 2014-07-15T15:49:59Z

Also, we should rename custom to extra to take into account this new use case.

erikrose · 2014-07-15T16:55:25Z

Here's something fun. I don't necessarily think it's a good idea.

Grammar(
    any_wiki = "mediawiki / freakywiki",
    a_custom_rule = lambda text, pos: ...,
    an_import = another_grammar,
    something_else = "'hi' / an_import",
    options = {'all': 'the things we might otherwise need specific kwargs for'}
)

It sure makes everything consistent, though I don't like all the extra punctuation (commas and quotes) you have to type and, more importantly, read. I could see making it so you could use strings as custom rules so people could have the option of doing something like this within an extra= kwarg. We'd just have to break the reference resolving off the RuleVisitor so the visitor can emit LazyReference-containing exprs.

keleshev · 2014-07-15T19:22:19Z

I don't like an exceptional custom or extra kwarg. Don't we get composition for free if grammars would implement Expression's interface?

erikrose · 2014-07-15T20:18:53Z

Even if grammars implemented the Expression interface, how would we make their entrypoints available in another grammar without a custom= or extra=?

erikrose · 2014-07-15T20:19:31Z

Also, you might want to call out to an expression of a grammar other than the default one.

keleshev · 2014-07-16T05:41:56Z

Even if grammars implemented the Expression interface, how would we make their entrypoints available in another grammar without a custom= or extra=?

Not totally sure what you mean. Like this?

Grammar("""
    any_wiki = mediawiki / freakywiki
    """,
    mediawiki = mediawiki_grammar['markup'],
    freakywiki = freakywiki_grammar['markup'])

Also, you might want to call out to an expression of a grammar other than the default one.

I think that grammar['rule'] should return an object which follows Expression interface.

erikrose · 2014-07-16T16:43:22Z

I think we have a failure to communicate. :-) Some clarifications:

grammar['rule'] returns an Expression subclass and always has.
Originally, we accepted custom rules as **kwargs. That was cool because it was brief and because it reflected dict's kwarg conventions. In 40d8bbe, I changed that to use a single custom kwarg that takes a dict. (I think we should rename it extra or, better, rules, so it reads better when there is no string blob passed in.) This is intended to save backward incompatibility with user-created grammars each time we have to add a kwarg to the Grammar constructor. We already have two: default_rule and the extends proposed here. If we didn't do this, we'd have 3 choices: either reserve a single options kwarg and nest all future options inside it, reserve some other section of the kwarg namespace for options, or suffer the risk of someone's grammar using, say, cache= as a custom rule but later breaking because we adopt the cache kwarg as a Grammar constructor option.

Does that clear things up? Do you agree?

keleshev · 2014-07-17T08:40:47Z

I wish we could just have kwargs rules, they are very readable, but I understand the problem.

Ok, just brainstorming here...

anywiki_partial_grammar = Grammar.from_string("""
    any_wiki = mediawiki / freakywiki
""")

wikis_mixin_grammar = Grammar.from_expressions(
 mediawiki=mediawiki_grammar['markup'],
 freakywiki=freakywiki_grammar['markup'],
)

Grammar.which_extends(anywiki_partial_grammar, 
                      wikis_mixin_grammar)

erikrose · 2014-07-22T22:57:29Z

Do you expect that you'll use custom rules much of the time? I am assuming they'd be fairly rare.

erikrose · 2014-07-22T23:33:35Z

Your multiple constructors idea brought to mind one alternative we hadn't considered: chaining methods.

g = Grammar("""
    a = b / c
    b = c d
    """
    ).more(d=lambda text, pos: pos+5,
           c=lambda text, pos: pos+1
    ).extend(other_grammar)

I think it looks worse than a dedicated kwarg. It's a little shorter but actually no fewer tokens than the kwarg:

g = Grammar("""
    a = b / c
    b = c d
    """,
    rules=dict(d=lambda text, pos: pos+5,
               c=lambda text, pos: pos+1),
    extend=other_grammar)

Plus we'd have to introduce partial grammars (likewise with your from_string example), which adds complexity. We'd have a new thing with different invariants and a multiplicative number of interactions with normal Grammars to decide about (and which users would have to learn).

For those reasons, I still favor the dedicated kwarg—or a spelling convention: we could reserve a certain segment of kwarg namespace for our non-rule kwargs. But no good conventions spring to mind.

I'm still curious whether you expect to need custom rules often; it seems a pretty niche case to optimize for. Let me know; I'm psyched to get something implemented.

keleshev · 2014-07-23T08:53:10Z

I don't think custom rules won't be used very often.

erikrose · 2014-07-24T17:14:56Z

Realized that abstract supergrammars are inherently incomplete grammars and will need to be representable somehow. Rethinking based on that. + may return.

erikrose · 2014-07-26T05:15:00Z

Okay, let's think about going back to kwarg-style custom rules. They're nice because they (a) have less syntax and (b) are more dict-like. The problem with them is that they smash into future constructor args.

The reasons I can think of to have a constructor arg are...

To input stuff you need to satisfy invariants
To influence how initialization computations are done (e.g., compiler hints)
To change how the whole instance behaves by altering its invariant rules or default behaviors. For example, you could mark a grammar as abstract or turn off by-default caching.

Barring throwing away our basest assumptions, we're clearly not going to add more requirements to instantiate a Grammar, so we're safe from (1). (3) can be addressed by making different flavors of Grammars (like AbstractGrammars) or adding alternate constructors. I don't forsee any of (2) at the moment; we're going to try to keep Parsimonious's optimizer pretty light-touch.

So let's try AbstractGrammar and .default() and Grammar(rules='', **more_rules) and +, and we can at least enjoy that beauty for awhile until we are unable to escape a reserved kwarg. Hopefully one of those will show its face before 1.0 if it's going to at all.

Problem: we make a big deal out of having to declare you want an AbstractGrammar, but addition could yield abstract or concrete ones, with no lexically evident clue as to which. This is such a corner case, though; we'll just let it slide. It is pretty Pythonic to defer the error anyway, as raise NotImplementedError does for abstract classes' methods. We'll raise a nice error ("Hey, this is missing 'foo' and 'smoo'") when they try to parse.

The motivation for making users say AbstractGrammar up front is that they get prompt error messages when designing grammars in the usual, non-abstract case. It makes intent clearer, too, when reading the code. AbstractGrammars will have parse() and match() methods that raise errors, and __getitem__ will raise KeyError with a descriptive message as well.

lucaswiman · 2022-04-14T03:23:16Z

Just writing down some misc. thoughts about this issue. I might make a PR implementing something like this. I think it's not really that much code.

My main use case for this at the moment is extending Parsimonious's syntax to allow richer expressions in token grammars. However, I don't think the syntax would make sense in general, just for my particular case. So I'd want to say something like:

"""
...somehow import parsimonious.grammar.rules_syntax...
quantifier = quantifier^ / attrs_expr
attrs_expr = "[" _ "@" identifier "=" atom "]"
"""

Then I'd subclass `RulesVisitor` with an updated `visit_quantifier` and `visit_attrs_expr`, making new syntax usable for my use case without either complicating the Grammar DSL or forking the library.

The issue is that because of the way references are resolved, there's not a good way to replace `quantifier` in the rules in the "super" grammar that use it. There, I specifically want the "new" version to override the old in all expressions in the super grammar. A few options:

Deepcopy the original grammar, and traverse the exprs replacing all nodes with the given name with the new name. (1)

Use custom_rules:
```python
from parsimonious.grammar import rules_syntax, rule_grammar, Grammar

new_grammar = Grammar(rules_syntax, custom_rules={"quantifier": custom_quantifier_rule, "attrs_expr": attrs_expr_rule})
...extend rulevisitor, etc.

This works, but it's a bit clunky. It seems to me, the key insight of parsimonious's good usability compared to past parser generator libraries in python is that a flexible string parsing library can take its own input as strings, and doesn't have to rely on the syntax of the host language. So the most natural kind of composition for parsimonious grammars feels like just composing strings in the ordinary way, possibly with a separator between sections. If grammars just retained their input string (and maybe custom rules), then this becomes easy. In a system like this, my example above would look like:

from parsimonious.grammar import Grammar, RuleVisitor

new_rule_grammar = Grammar(fr"""
{Grammar.rule_grammar.definition}
------------
quantifier = quantifier^ / attrs_expr
attrs_expr = "[" _ "@" identifier "=" atom "]"
""")
class MyCustomRuleVisitor(RuleVisitor):
    def visit_quantifier(...):
        ...
    def visit_attrs_expr(...):
        ...

class MyCustomGrammar(Grammar):
    visitor_cls = MyCustomRuleVisitor  # requires a small refactor of `_expressions_from_rules`

For the separator, I'm thinking at least two - or = characters.

This skirts past a lot of the problems with validating "abstract" grammars: they're just strings. To extend an abstract grammar, you do something like:

Grammar(fr"""
{abstract_grammar}  # references but does not define ruleX
ruleX = ...
""")

erikrose added a commit that referenced this issue Jul 26, 2014

Replace default_rule kwarg with .default(). Refs #30.

97874ce

erikrose mentioned this issue Mar 3, 2015

Added the ability of parsimonious to parse lists of tokens. #40

Closed

erikrose mentioned this issue Jul 15, 2016

Support Grammar Fragments by disallowing rules #90

Open

lucaswiman mentioned this issue Apr 20, 2022

Grammar extensions #208

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize grammar composition story #30

Finalize grammar composition story #30

erikrose commented May 8, 2013

erikrose commented Jul 12, 2014

erikrose commented Jul 12, 2014

erikrose commented Jul 13, 2014

erikrose commented Jul 14, 2014

erikrose commented Jul 14, 2014

erikrose commented Jul 14, 2014

keleshev commented Jul 14, 2014

erikrose commented Jul 14, 2014

keleshev commented Jul 14, 2014

erikrose commented Jul 15, 2014

erikrose commented Jul 15, 2014

erikrose commented Jul 15, 2014

erikrose commented Jul 15, 2014

keleshev commented Jul 15, 2014

erikrose commented Jul 15, 2014

erikrose commented Jul 15, 2014

keleshev commented Jul 16, 2014

erikrose commented Jul 16, 2014

keleshev commented Jul 17, 2014

erikrose commented Jul 22, 2014

erikrose commented Jul 22, 2014

keleshev commented Jul 23, 2014

erikrose commented Jul 24, 2014

erikrose commented Jul 26, 2014

lucaswiman commented Apr 14, 2022

Finalize grammar composition story #30

Finalize grammar composition story #30

Comments

erikrose commented May 8, 2013

erikrose commented Jul 12, 2014

erikrose commented Jul 12, 2014

erikrose commented Jul 13, 2014

erikrose commented Jul 14, 2014

erikrose commented Jul 14, 2014

erikrose commented Jul 14, 2014

keleshev commented Jul 14, 2014

erikrose commented Jul 14, 2014

keleshev commented Jul 14, 2014

erikrose commented Jul 15, 2014

erikrose commented Jul 15, 2014

erikrose commented Jul 15, 2014

erikrose commented Jul 15, 2014

keleshev commented Jul 15, 2014

erikrose commented Jul 15, 2014

erikrose commented Jul 15, 2014

keleshev commented Jul 16, 2014

erikrose commented Jul 16, 2014

keleshev commented Jul 17, 2014

erikrose commented Jul 22, 2014

erikrose commented Jul 22, 2014

keleshev commented Jul 23, 2014

erikrose commented Jul 24, 2014

erikrose commented Jul 26, 2014

lucaswiman commented Apr 14, 2022