-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pest language evolution #333
Comments
This sounds quite good, specially the part about modules. It would make things like I probably need more time to give you a more complete opinion (as mentioned in gitter, this semester is getting quite hectic in assignments 😿) but overall I think the issues this RFC addresses make a lot of sense! |
Would it not make sense to declare the pest version in the pest file? That keeps the information closer (and I wouldn't have to duplicate the attribute for pest-ast). Draft: pest files MAY start with a The rest of the file is then parsed with the appropriate parser. (Or if only semantics change and the new grammar is a superset, one parser is used and the semantics are set.) |
Due to the way the trivia behaves in an unexpected and non-obvious way from looking at the grammer alone, I feel like you should use these operators instead:
These would communicate behaviour and intent more clearly as well as being a bit more 'obvious' on first look. I think most programmers would expect the non-trivia operation to be the default -- i.e. one thing follows another without surprise and the inclusion of the Keeping the non-trivia operators plain / spare better matches standard regex behaviour which is where the majority of programmers are going to recognise these operators from. |
i agree with @Kroc: |
Other than concern over operators, this spec is a skilful and elegant summary of Pest that aims to increase functionality in critical areas without causing a combinatorial explosion of complexity. |
As a counterpoint: "traditional" parsing that uses a lexer tends to drop trivia and ignore it in the first preprocessing step. e.g. The whole idea of "trivia" in programming language grammars is that the trivia can be ignored; that it can appear anywhere and not interfere with anything. The only time I'd see a "normal" language using strict no-trivia sequences is in the definition of the trivia itself and in tokens. In my "modern parser tool" sketch, every repetition consumed trailing trivia. In the last version of syn that used nom instead of the proc_macro lexer, each "terminal" parser was expected to eat (and ignore) leading trivia. I don't disagree that making it clear is a bad idea, just that no-trivia is an expected default mode of operation. |
Minor note: Currently, repetitions match trailing trivia. This doesn't cause much issue in practice, but in a world where strict sequencing is easier to do we should probably try to avoid that footgun and make |
I think the right choice here would be to simply have the uglier version where it will be used least. To me, that seems to be the non-trivia case, but I think it's best to actually take a look over multiple real-world use cases in order to figure this one out. I guess @CAD97 already has this in nafi. Also, @CAD97, that seems to be a bug! It's caused by the fact that |
I'm still not completely sure whether in-file version definition is entirely possible in the long run. I'm approaching a second RFC for an intermediate-representation that the optimizer and generator would use that is long over-due. The optimizer should have a much cleaner approach that is smarter and less error-prone. |
When compiling or interpreting. But not when e.g. auto-refactoring. |
I'd like to propose a third sequencing operator that requires trivia between two other rules. To keep things symmetric (and because it's already reserved) I like
And have quite a few rules that look something like:
It would be great to just write |
Also, I tend to agree with @Kroc and @flying-sheep regarding repetition operators. It's also worth considering how parameterized rules will shift the trade-offs. For example:
Generally speaking, the expressive power of parameterized rules makes any implicit behaviour of builtin operators much less of a clear win, as it's trivial to explicitly define reusable precise parsing behaviour. |
I missed the "Controlled trivia" section of the RFC when I wrote the above comments. It seems like I might be able to do what I want by redefining
counter proposalIt seems to me that the grammar language could be simplified a lot (without losing much in the way of power) by completely removing automatic trivia and it's associated operators and adopting two straightforward rules.
Given the above, users can trivially define the current behaviour of
To get shorthand operators for repetition, we add a second rule:
I imagine that many grammars are relying heavily on the trivia handling of benefits
trade-offs
|
@grncdr, this is a really good proposal. I think it's fairly safe to say that the transition will not be very hard to do with a small tool. The only issue I have with this approach is that |
@grncdr The biggest problem with "simple adjacency" I actually do like the purity of having "simple adjacency" be "strict sequencing" and Also, the definition |
We could have the following approach:
|
problems with adjacency
I agree that should repetition include trailing productions?There's some difference of opinion here:
It boils down to a subjective choice, repetitions allowing trailing productions are common (e.g. trailing whitespace, comments, and even commas in many languages) but so is strict Parenthesis ambiguity and higher-order rulesI separated this out because it's quite a bit more "hypothetical" than my other comments Regarding the
If Pest generated an error like this:
I think most people would not be confused by the distinction between The only thing that prevents me from proposing this more strongly is the possibility of passing unapplied parameterized rules as a parameters. E.g. should the following grammar should be legal?
If not, then overloading the meaning of parenthesis should be fine. If you do want to allow higher-order parameterized rules, then an exception like "a parameterized rule can be referenced without applying it only when passed as a parameter" would probably work, but I don't (yet?) see much purpose for this degree of indirection in grammars. |
QuestionsI'm familiar with other peg parsers, so there are some things about pest that make me scratch my head. It would help if I understood the motivation of some decisions. The first would be that pest doesn't let you specify code that executes and specifies the return value for each production variant of a rule. Instead, it seems pest just creates its own AST and lets you traverse through that to do your own execution. I'm curious about the advantages of doing things this way. The second would be that pest requires The third would be that pest needs to have SuggestionsI think pest should have versioning at the top of the .pest file. It can default to version 1.0 if none is specified. This would allow future versions of pest to check if it can handle the specified version, and specify which crate version to use otherwise. It could also specify a tool to convert the grammar to a later version. I like the proposal @grncdr put forth of using simple adjacency for concatenation and having The For parametrized rules, I have some questions.
Would it allow quantifiers as arguments?
If so, this would justify the need for a concatenation operator, or alternatively a quantifier operator would be needed. For example doing so with
|
Disclaimer: I joined the party pretty recently, only just before the release of 2.0. I don't speak about the history from a authoritative source, but more from what I've picked up along the way.
The main benefit is purity. Pest's area of focus is as a parsing library, not as an AST. If you want to use something like rowan to handle your syntax tree, then you'll need to adapt your parsing tools' output anyway. When I was using ANTLR5, which allows arbitrary actions to be attached to rules, I personally still found myself mapping the generated parse tree to a custom abstract syntax tree. Another benefit of that purity is that things like the web VM are possible. ANTLR has a interpreter mode, but it chokes on any real grammar because those contain actions rather than everything being sematic in the grammar. Because pest doesn't allow actions, the interpreter works just as perfectly as the compiled version. Pest's ideal is also that the grammar serves as a formal grammar for your language, rather than just a series of instructions on how to generate a parser. Pest's job is to generate the parse tree, and full control over the structure of the parse tree allows it to do some clever things to reduce the required nesting (as in, the pest tree is really just a single vector behind the tree abstraction). It's the same reason that the unified rule typing exists: working with a singular type is more convenient than a hundred or more. In anything much bigger than an example project, you'll probably want any parse tree to convert into a different tree structure so that the rest of your program can ignore information that has to be stored in order to parse. pest-ast has proven the feasibility of building a syntax tree layer on top of pest's parse tree. That said, we are definitely interested in potentially making 3.0's user-exposed API a lot more typed, since that's a common pain point. It's mostly just blocked on @dragostis's work on the optimizing parsing engine that it's planned to be built on.
IIUC, the initial reason for using That said, you still have to consider the atomic rules. The keyword I realize real languages are messy. I personally agree that the conflation of atomic rules with removing implicit trivia is problematic; that's the reason for the proposed trivia handling being
It's just a stylistic choice, though I believe it's one again rooted in pest originally being implemented by Maybe you'd like the draft I have of a syn-compatible parser generator?
I'm not sure what you mean about nested rules. You seem to be implying that indenting rules somehow "scope" them to the prior rule? No, they're just normal rules.
No, arguments must themselves be expressions. |
Yeah, that's what I was thinking. I took a look at the sample in the introduction and was not sure what the indentation was doing, so I got carried away. There have been a few times where I've wanted rule scoping in the past. |
Apart from a few other quality of life additions like raw strings, the 3.0 grammar will probably make braces optional. They were added for historical reasons, as @CAD97 mentioned, and kept for the sake of consistency. With 3.0 we should be able to take more time and refine as much as possible before the final release. I currently have a working parser for 3.0 that I'm experimenting with and the solution I found most intuitive is having overridable sequence operators. Apart from WIP of what I've been playing with. |
Is there a way to track progress of v3? |
@Keats, the somewhat little progress I've made lives here: https://github.com/pest-parser/pest3 I'd like to invest more time, but it would be far easier for me if there was a way to collaborate with someone on this and talk about what needs to be done. |
Is the initial comment up to date with the goals? Maybe the scope could be lowered a bit and not handle multiple versions if it's a lot of work. A pest2 parser will still work and someone can update to pest3 if wanted/needed.
|
Small idea regarding adjacency and parentheses: we could require parameterized rules to be macro-like:
I believe this removes all ambiguity and we can use adjacency for sequencing. |
To make the transition smooth, it would be easy to support two syntaxes for the grammar for a while: the code generation need not change, and we would be able to deprecated the old syntax over a long period of time. No need for any breaking change in fact I think. |
Would COMMENT and WHITESPACE still exist with this proposal? If it does, how would it work with modules? Each module will have it's own definition of them that or not overridable? Or will they be overridable? |
I tried to re-hash this discussion and other ideas from different issues into multiple threads here: #885 |
@Kroc @dragostis any thoughts on this: #1016 (reply in thread) ? |
@jhoobergs in the prototype, those two special builtin rules don't make sense (given one can define the trivia on the operators). I guess it'll make sense if they are scoped to each module? |
Summary
This RFC hopes to address the concerns in #197, #261, #271, and #329 by laying the foundation of pest's evolution and transition.
Motivation
While pest grammars offer an expressive language for building grammars, they lack certain features we've become accustomed with in programming languages which weakens their effectives as expressive and reusable tools. With the growing popularity of the project, more and more discussion has been focused on improving the predictability of pest as a language and a number of needs have been put forth: trivia handling, reusability, expressiveness, and general consistency.
Trivia handling complexity
Probably the hardest concept to grasp when first learning the ropes is how trivia, i.e. whitespace and comments, are handled. pest has an automatic mechanism that simply permits trivia to live between expressions which is controlled by atomicity. Since atomic rules are cascading, it's not immediately obvious if two sequenced expressions
a ~ b
accept trivia—it wholly depends on whether or not the current rule inherits atomicity.The example above illustrates how
confusing
can accept trivia in some cases but not others.Reusability of expressions and rules
While rules can be composed from one another, there is currently no means to parametrize them. Parametrization can be extremely useful in cases where some idioms are often reused, e.g. repeated, separated values. Currently, you need to repeat some form of
e ~ ("," ~ e)*
which is less readable thanseparated(e, ",")
.Though less immediately useful, another addition would be to be able to use rules from different grammars.
Expressiveness
Improving expressiveness is somewhat of a continuously open question. In 2.0 we've added additional stack calls that help recognize indentation-sensitive languages, namely
PEEK_ALL
,POP_ALL
, andDROP
. This conservative design was adopted in order to better understand what exactly is needed in real-world examples.However, legitimate need of more refined localization within the stack has been illustrated in #329. Being able to accurately slice the stack for every one of the
PEEK
,POP
,DROP
calls seems to be required going forward.General consistency
With the introduction of built-in rules, capitalization has been selected as a form of differentiation from user-defined rules. Capitalized are also stack calls, start- and end-of-input calls, and unicode categories. The only way of differentiating between them is to simply know ahead of time what they do.
Guide-level explanation
Versioning
The pest language will be versioned according to the semver guide and grammar language versions will be optionally selected before parsing. This will ensure a smoother transition to 3.0, and beyond, it will be enabling users to opt-in to the newer version early on.
Modules
Akin to Rust's modules, a module can contain rules or other modules. This removes the need for capitalization of built-in rules. They can be part of separate modules.
Parametrizable rules
Rules will have optional arguments. Their definition will be parametrizable with argument names, all of them being valid pest expressions.
Controlled trivia
The infix sequence operator
~
itself will be a user-defined rule:Without any
~
defined,~
,*
,+
, and{}
operators will all run according to their definitions without accepting any trivia between expressions. When it is defined, the repetitions will make use of the sequence operator:In order to be able to have both trivia-accepting and non-trivia-accepting operators working together, separate non-trivia operators will be introduced, namely
-
for sequence and all repetitions preceded by it:~
-
*
-*
+
-+
{n}
-{n}
{n..}
-{n..}
{..n}
-{..n}
{..=n}
-{..=n}
{m..n}
-{m..n}
{m..=n}
-{m..=n}
Stack slicing
Stack slicing will work similarly to Rust slicing with the exception that ranges will accept negative end values, similarly to Python. Slicing will happen from bottom to top such that for a stack
[a, b, c, d, e]
:[1] == a
[-1] == e
[1..4] == [b, c, d]
[1..-1] == [b, c, d]
[1..=-1] == [b, c, d, e]
[..-2] == [a, b, c]
As such,
pest::stack::*
, i.e.peek
,pop
,drop
, can be optionally sliced or indexed, e.g.pest::stack::peek[..-1]
. The indices will be constant with the exception of those relative to the top of the stack since the stack's size is variable.Reference-level explanation
The grammar's version will be selected through the grammar attribute:
#[grammar = "grammar.pest", version = "3.0"]
pest_meta
will handle both grammar language versions during the2.*
transition period, then migrate to 3.0. This will need to be enforced if we want to take advantage of the more concise grammars during optimization and generation.Much of the rest of this RFC is straight-forward:
pest_meta
andpest_generator
)pest_meta
andpest_generator
)Drawbacks
Breaking compatibility so early could be dangerous, but we can offer help for people migrating to 3.0. If need be, we could also offer a pest fix tool that would be able to convert 2.0 to 3.0 grammars.
Some of the syntax introduced in the trivia handling might be a little heavy on the eye and we might want to fine tune it before it's set in stone.
The text was updated successfully, but these errors were encountered: