Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize grammar composition story #30

Open
erikrose opened this issue May 8, 2013 · 25 comments
Open

Finalize grammar composition story #30

erikrose opened this issue May 8, 2013 · 25 comments

Comments

@erikrose
Copy link
Owner

erikrose commented May 8, 2013

Reconsider dict-like grammars:

However, pros:

  • You can .update() them to compose. OTOH, won't we have to redo a bunch of compilation and optimization on update? That seems like it should be named something more explicit, like compose(). [Ed: update() is not a very good parallel for the type of thing extends will do; extends maintains Grammar boundaries (for delegation), while update() just paves over keys, one at a time.]
  • You can enumerate the rules without getting repr and such back.
  • You can use Python keywords as rule names.
  • Rule names don't collide with other Grammar method names.

If we were to store expressions on methods of a grammar, we could change or alias Expression.parse to Expression.__call__.

@erikrose
Copy link
Owner Author

Moved from a code comment:

We can either have extending callers pull the rule text out of repr, or we could get fancy and define __add__ on Grammars and strings. Or maybe, if you want to extend a grammar, just prepend (or append?) your string to its, and yours will take precedence. Or use the OMeta delegation syntax.

@erikrose
Copy link
Owner Author

I guess we'll have to go with something heavier, like __add__, since the introduction of custom rules means Grammars are no longer completely round-trippable as strings.

@erikrose
Copy link
Owner Author

One nice thing about spitting out a new class for each Grammar is that we could use super() to implement parent-rule delegation. But OTOH, I don't want to get married to simply calling other rules; it stomps on the possibility for optimization.

Strawman:

foo = goof ball
foo = my_addition / foo^  # delegates to previously described foo rule

@erikrose
Copy link
Owner Author

We'll use the rule_name^ syntax to delegate to a supergrammar's rule:

super = Grammar("""
    hi_mom = greeting " mom"
    greeting = "yo" / "hi" / "sup"
    """)

sub = Grammar("""
    greeting = "hello" / greeting^
    """,
    extends=super)

sub.parse('hello mom')

We don't override __add__ and do grammar1 + grammar2, because the second addend wouldn't be a valid Grammar, having unresolveable delegations and possibly other references to nonexistent rules.

extends can take either a single grammar or an iterable of them. Multiple supers at once are necessary: for a simple case, what if you wanted to just make a grammar that's an OR of two others? You'd need to have both around, or you'd get unresolved reference errors.

We should move the custom rules into a custom kwarg [done] so we don't continually have to be backwards-incompatible (or use weird names) every time we add a kwarg. We had to add one just 2 days after introducing custom rules. Custom rules are a pretty niche case and not worth wrecking our flexibility over.

The question arose whether to support delegation to non-uniquely named rules within a single grammar:

greeting = "hi"
greeting = greeting^ / "hello"

I believe this is a bad idea. It causes the order of rules within a grammar to matter in one special case; this isn't true more generally and could lead to confusion. For instance, if the user views grammar addition as concatenation of the string-based rules, he will be surprised when foo references the second instance of bar rather than the first, when these two are added together:

bar = "heigh"
foo = bar
bar = "ho"

Thus, order of rules will not matter except that later rules of a given name within a single grammar will shadow earlier ones. (Subgrammar rules of a given name will also shadow rules of that name from the supergrammar, except for when ^ is used.)

Delegated-to but otherwise shadowed supergrammar rules remain only as anonymous sub-Expressions of the delegating rule. We strip the name off so the subgrammar's visitor method receives the output of only the new rule. If it likes, the method can delegate to the supergrammar's visitation code as appropriate. (As a bonus, this keeps multiple same-named rules from existing in a grammar, which would be un-dict-like and make reprs inaccurate.)

To implement, we will pass zero or more supergrammars into RuleVisitor at initialization. A visit_delegation() grabs any reference to uncompiled (that is, having references unresolved) supergrammar rules off the supergrammar at visit time and sets their names to empty. (We will need to start keeping uncompiled versions of all unshadowed expressions around in grammars.) Existing reference resolution code then stitches everything together normally.

Custom rules should not be a problem. They either reference other rules lazily, by dereferencing the grammar at parse time (and so are not in danger of accidentally referencing an overridden supergrammar rule), or they contain LazyReferences, which should resolve just as in any other Expression. Like with other Expressions, we will keep uncompiled versions of them around to allow for grammar extension. We might add some facility for custom rules to do delegation: perhaps a DelegatedReference akin to a LazyReference.

@erikrose
Copy link
Owner Author

@halst, I wouldn't mind your feedback on the above.

@erikrose
Copy link
Owner Author

Consider that OMeta's delegation syntax is just "super"; that is, it doesn't allow delegation to supergrammar rules other than the one you're currently overriding. Is it useful to be able to do so, or should we simplify and make our syntax shorter?

@keleshev
Copy link
Contributor

extends=... is super. I was about to implement + for my peg.rb and you made a compelling case not to do that.

Another thing I was thinking, inheritance is good, but maybe there is a place for composition. For example:

version = Grammar('''
    version <- digit+ "." digit+ "." digit+
    digit <- [0-9]''')

semantic_version = Grammar('''
    version <- basic_version ("-" meta_data)? ("+" meta_data)?
    meta_data <- [0-9A-Za-z-]+''',
    basic_version = version)

@erikrose
Copy link
Owner Author

In that case, you might also like the critique of __add__ that I didn't post:

Syntactically, we'll override __add__. This is mostly because a = b + c is more memorable than something like a = b.extended_with(c), and I like the += shortcut. I could be persuaded otherwise, however. Many properties of addition on other domains are not true over Grammars: grammar addition is not commutative, not divisible (summing all the rules together is different than summing the grammars, which are sort of sums of rules), and perhaps not associative. Hmm. So it turns out I've argued myself to neutrality on that point.

@keleshev
Copy link
Contributor

So it turns out I've argued myself to neutrality on that point.

😄

@erikrose
Copy link
Owner Author

You raise a good point about composition. Sometimes it's nice to overlay the namespaces and override rules; other times, it's nice to compose and let the grammar we're delegating to encapsulate its rules and keep the relationships between them. If nothing else, it would be nice to know that rules I add to my subgrammar won't collide with ones added in the future by the supergrammar. (Heh, most programming languages don't even do anything to help with that in subclassing.) At any rate, this requires more design thought.

@erikrose
Copy link
Owner Author

You're not going to believe this. It just so happens that, because of the way custom rules are built, we get composition for free. For example, let's compose 2 wiki grammars to make an uber-wiki parser that tries multiple dialects, one after the other, until one works:

Grammar("""
    any_wiki = mediawiki / freakywiki
    """,
    # Now we do, in essence, "from mediawiki_grammar import markup"--and the
    # same for freakywiki:
    custom={'mediawiki': mediawiki_grammar['markup'],
            'freakywiki': freakywiki_grammar['markup']})

The external grammars already have all their references resolved, so there's no danger of _resolve_refs() plowing through them and stitching rules defined in the new grammar into them. We don't have to do anything to the DSL. It's beautiful. Do you see any holes?

(We could even teach custom rules to take Grammars directly and go grab their default rules, meaning we wouldn't need to say ['markup'].)

@erikrose
Copy link
Owner Author

Also, we should rename custom to extra to take into account this new use case.

@erikrose
Copy link
Owner Author

Here's something fun. I don't necessarily think it's a good idea.

Grammar(
    any_wiki = "mediawiki / freakywiki",
    a_custom_rule = lambda text, pos: ...,
    an_import = another_grammar,
    something_else = "'hi' / an_import",
    options = {'all': 'the things we might otherwise need specific kwargs for'}
)

It sure makes everything consistent, though I don't like all the extra punctuation (commas and quotes) you have to type and, more importantly, read. I could see making it so you could use strings as custom rules so people could have the option of doing something like this within an extra= kwarg. We'd just have to break the reference resolving off the RuleVisitor so the visitor can emit LazyReference-containing exprs.

@keleshev
Copy link
Contributor

I don't like an exceptional custom or extra kwarg. Don't we get composition for free if grammars would implement Expression's interface?

@erikrose
Copy link
Owner Author

Even if grammars implemented the Expression interface, how would we make their entrypoints available in another grammar without a custom= or extra=?

@erikrose
Copy link
Owner Author

Also, you might want to call out to an expression of a grammar other than the default one.

@keleshev
Copy link
Contributor

Even if grammars implemented the Expression interface, how would we make their entrypoints available in another grammar without a custom= or extra=?

Not totally sure what you mean. Like this?

Grammar("""
    any_wiki = mediawiki / freakywiki
    """,
    mediawiki = mediawiki_grammar['markup'],
    freakywiki = freakywiki_grammar['markup'])

Also, you might want to call out to an expression of a grammar other than the default one.

I think that grammar['rule'] should return an object which follows Expression interface.

@erikrose
Copy link
Owner Author

I think we have a failure to communicate. :-) Some clarifications:

  • grammar['rule'] returns an Expression subclass and always has.
  • Originally, we accepted custom rules as **kwargs. That was cool because it was brief and because it reflected dict's kwarg conventions. In 40d8bbe, I changed that to use a single custom kwarg that takes a dict. (I think we should rename it extra or, better, rules, so it reads better when there is no string blob passed in.) This is intended to save backward incompatibility with user-created grammars each time we have to add a kwarg to the Grammar constructor. We already have two: default_rule and the extends proposed here. If we didn't do this, we'd have 3 choices: either reserve a single options kwarg and nest all future options inside it, reserve some other section of the kwarg namespace for options, or suffer the risk of someone's grammar using, say, cache= as a custom rule but later breaking because we adopt the cache kwarg as a Grammar constructor option.

Does that clear things up? Do you agree?

@keleshev
Copy link
Contributor

I wish we could just have kwargs rules, they are very readable, but I understand the problem.

Ok, just brainstorming here...

anywiki_partial_grammar = Grammar.from_string("""
    any_wiki = mediawiki / freakywiki
""")

wikis_mixin_grammar = Grammar.from_expressions(
 mediawiki=mediawiki_grammar['markup'],
 freakywiki=freakywiki_grammar['markup'],
)

Grammar.which_extends(anywiki_partial_grammar, 
                      wikis_mixin_grammar)

@erikrose
Copy link
Owner Author

Do you expect that you'll use custom rules much of the time? I am assuming they'd be fairly rare.

@erikrose
Copy link
Owner Author

Your multiple constructors idea brought to mind one alternative we hadn't considered: chaining methods.

g = Grammar("""
    a = b / c
    b = c d
    """
    ).more(d=lambda text, pos: pos+5,
           c=lambda text, pos: pos+1
    ).extend(other_grammar)

I think it looks worse than a dedicated kwarg. It's a little shorter but actually no fewer tokens than the kwarg:

g = Grammar("""
    a = b / c
    b = c d
    """,
    rules=dict(d=lambda text, pos: pos+5,
               c=lambda text, pos: pos+1),
    extend=other_grammar)

Plus we'd have to introduce partial grammars (likewise with your from_string example), which adds complexity. We'd have a new thing with different invariants and a multiplicative number of interactions with normal Grammars to decide about (and which users would have to learn).

For those reasons, I still favor the dedicated kwarg—or a spelling convention: we could reserve a certain segment of kwarg namespace for our non-rule kwargs. But no good conventions spring to mind.

I'm still curious whether you expect to need custom rules often; it seems a pretty niche case to optimize for. Let me know; I'm psyched to get something implemented.

@keleshev
Copy link
Contributor

I don't think custom rules won't be used very often.

@erikrose
Copy link
Owner Author

Realized that abstract supergrammars are inherently incomplete grammars and will need to be representable somehow. Rethinking based on that. + may return.

@erikrose
Copy link
Owner Author

Okay, let's think about going back to kwarg-style custom rules. They're nice because they (a) have less syntax and (b) are more dict-like. The problem with them is that they smash into future constructor args.

The reasons I can think of to have a constructor arg are...

  1. To input stuff you need to satisfy invariants
  2. To influence how initialization computations are done (e.g., compiler hints)
  3. To change how the whole instance behaves by altering its invariant rules or default behaviors. For example, you could mark a grammar as abstract or turn off by-default caching.

Barring throwing away our basest assumptions, we're clearly not going to add more requirements to instantiate a Grammar, so we're safe from (1). (3) can be addressed by making different flavors of Grammars (like AbstractGrammars) or adding alternate constructors. I don't forsee any of (2) at the moment; we're going to try to keep Parsimonious's optimizer pretty light-touch.

So let's try AbstractGrammar and .default() and Grammar(rules='', **more_rules) and +, and we can at least enjoy that beauty for awhile until we are unable to escape a reserved kwarg. Hopefully one of those will show its face before 1.0 if it's going to at all.

Problem: we make a big deal out of having to declare you want an AbstractGrammar, but addition could yield abstract or concrete ones, with no lexically evident clue as to which. This is such a corner case, though; we'll just let it slide. It is pretty Pythonic to defer the error anyway, as raise NotImplementedError does for abstract classes' methods. We'll raise a nice error ("Hey, this is missing 'foo' and 'smoo'") when they try to parse.

The motivation for making users say AbstractGrammar up front is that they get prompt error messages when designing grammars in the usual, non-abstract case. It makes intent clearer, too, when reading the code. AbstractGrammars will have parse() and match() methods that raise errors, and __getitem__ will raise KeyError with a descriptive message as well.

@lucaswiman
Copy link
Collaborator

Just writing down some misc. thoughts about this issue. I might make a PR implementing something like this. I think it's not really that much code.

My main use case for this at the moment is extending Parsimonious's syntax to allow richer expressions in token grammars. However, I don't think the syntax would make sense in general, just for my particular case. So I'd want to say something like:

"""
...somehow import parsimonious.grammar.rules_syntax...
quantifier = quantifier^ / attrs_expr
attrs_expr = "[" _ "@" identifier "=" atom "]"
"""

Then I'd subclass `RulesVisitor` with an updated `visit_quantifier` and `visit_attrs_expr`, making new syntax usable for my use case without either complicating the Grammar DSL or forking the library.

The issue is that because of the way references are resolved, there's not a good way to replace `quantifier` in the rules in the "super" grammar that use it. There, I specifically want the "new" version to override the old in all expressions in the super grammar. A few options:

Deepcopy the original grammar, and traverse the exprs replacing all nodes with the given name with the new name. (1)

Use custom_rules:
```python
from parsimonious.grammar import rules_syntax, rule_grammar, Grammar

new_grammar = Grammar(rules_syntax, custom_rules={"quantifier": custom_quantifier_rule, "attrs_expr": attrs_expr_rule})
...extend rulevisitor, etc.

This works, but it's a bit clunky. It seems to me, the key insight of parsimonious's good usability compared to past parser generator libraries in python is that a flexible string parsing library can take its own input as strings, and doesn't have to rely on the syntax of the host language. So the most natural kind of composition for parsimonious grammars feels like just composing strings in the ordinary way, possibly with a separator between sections. If grammars just retained their input string (and maybe custom rules), then this becomes easy. In a system like this, my example above would look like:

from parsimonious.grammar import Grammar, RuleVisitor

new_rule_grammar = Grammar(fr"""
{Grammar.rule_grammar.definition}
------------
quantifier = quantifier^ / attrs_expr
attrs_expr = "[" _ "@" identifier "=" atom "]"
""")
class MyCustomRuleVisitor(RuleVisitor):
    def visit_quantifier(...):
        ...
    def visit_attrs_expr(...):
        ...

class MyCustomGrammar(Grammar):
    visitor_cls = MyCustomRuleVisitor  # requires a small refactor of `_expressions_from_rules`

For the separator, I'm thinking at least two - or = characters.

This skirts past a lot of the problems with validating "abstract" grammars: they're just strings. To extend an abstract grammar, you do something like:

Grammar(fr"""
{abstract_grammar}  # references but does not define ruleX
ruleX = ...
""")

@lucaswiman lucaswiman mentioned this issue Apr 20, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants