-
-
Notifications
You must be signed in to change notification settings - Fork 116
Conversation
Hi. I'm playing around with re-implementing the JDL parser using the Chevrotain parsing library. This is related to the technology choice discussion in #141 and the This PR currently contains a lexer for JDL implemented using Chevrotain RegExp based lexer engine. There are some minor changes from the original JDL pegjs implementation documented in the comments. The next step is to try and convert the grammar itself (just syntax, no AST building). |
@bd82 this looks interesting and personally, I like it as it seems more readable than PegJS syntax. |
Thanks @deepu105 . Being an internal DSL the grammar itself is imho a little uglier than pure EBNF style syntax, but still highly readable by using vertical spacing. Example: // comments will be handled outside(after) the parser in this implementation.
$.RULE('entityBody', () => {
$.CONSUME(t.LCURLY);
$.AT_LEAST_ONE(() => {
$.SUBRULE($.fieldDec);
});
$.CONSUME(t.RCURLY);
});
$.RULE('fieldDec', () => {
$.CONSUME(t.NAME);
$.SUBRULE($.type);
// Short form for: "(X(,X)*)?"
$.MANY_SEP({
SEP: t.COMMA,
DEF: () => {
$.SUBRULE($.validation);
}
});
$.CONSUME(t.RCURLY);
}); With pegjs it would look something like: (taken from existing Grammar)
Which is in theory could be prettier, but because many things were added:
The end result is much less readable imho as there is no separation of concerns... But it is not just about readability, it is also about maintainability. |
How a parser implemented using Chevrotain would look like. Next step would be some tests to demonstrate capabilities.
Please have a look at a subset of the grammar implemented in latest commit. The next step would be to add a few tests to demonstrate capabilities on this JDL grammar subset.
Hopefully I will have time to implement some of these tomorrow. |
lib/dsl/chev_grammar.js
Outdated
|
||
// HIGHLIGHT: | ||
// "MIN_MAX_KEYWORD" is an "abstract" token which other concrete tokens inherit from. | ||
// This can be used to reduce verbosity in the parser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in the validation rule instead of specifying the six different keywords.
https://github.com/jhipster/jhipster-core/pull/142/files#diff-802ee05eaf770a8bbbc2fe7ef13a3efaR233
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the corresponding section in the existing grammar:
jhipster-core/lib/dsl/grammar.txt
Lines 511 to 522 in fd8f712
/ MINLENGTH SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'minlength', value: int }; } | |
/ MINLENGTH SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'minlength', value: constantName, constant: true }; } | |
/ MAXLENGTH SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'maxlength', value: int }; } | |
/ MAXLENGTH SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'maxlength', value: constantName, constant: true }; } | |
/ MINBYTES SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'minbytes', value: int }; } | |
/ MINBYTES SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'minbytes', value: constantName, constant: true }; } | |
/ MAXBYTES SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'maxbytes', value: int }; } | |
/ MAXBYTES SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'maxbytes', value: constantName, constant: true }; } | |
/ MIN SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'min', value: int };} | |
/ MIN SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'min', value: constantName, constant: true }; } | |
/ MAX SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'max', value: int };} | |
/ MAX SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'max', value: constantName, constant: true }; } |
The token inheritance does not have to be used.
It is an example for what is possible and could be considered...
lib/dsl/chev_grammar.js
Outdated
// very important to call this after all the rules have been defined. | ||
// otherwise the parser may not work correctly as it will lack information | ||
// derived during the self analysis phase. | ||
Parser.performSelfAnalysis(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is during the object's construction-time. All this. Why not having another way?
One file is enough, but putting everything in the constructor isn't really something I look forward to maintaining, even if the improvement of using using Chevrotain over PegJS is obvious. Why not, for instance, use a factory of some sort (a function that calls other functions to build the parser instance)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not having another way?
Answer
The syntax I prefer relies on using class fields ESNext syntax.
https://github.com/tc39/proposal-class-fields
But this is not yet supported afaik (currently stage 3 proposal).
I suppose Babel will support this at some point:
babel/proposals#12
TypeScript has something similar which already works now.
See this example:
This is similar the "official" API I'm aiming for, but may need to wait for ES2018 for that. :(
Alternative
Anyhow as it is all just plain JavaScript you can define it (mostly) however you want...
An extreme example would be this completely different DSL for specifying Chevrotain grammars
https://github.com/kristianmandrup/chevrotain-rule-dsl
I'm am a bit too tired to think a concrete alternative syntax right now.
But I believe one should be possible even with ES6, perhaps you have a suggestion?
The constraints are:
-
Parser.performSelfAnalysis must be called after the rules have been defined.
- As it relies on side effects of creating the rules.
-
The RULE calls must be called in the context of the parser instance (this).
And if it helps normally you only use a single parser instance and reset it's internal state before each use.
Future / Long term.
There is also an open issue for better support of custom APIs for building Chevrotain parsers.
And I'm hoping in the long term to support three different API styles (same as Mocha/Chai have different APIs using the same underlying engine).
- Low Level Hand-Built style.
- Combinator Style, fluent DSL.
- EBNF generator style (Like pegjs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a really quick and dirty factory style hack.
https://github.com/SAP/chevrotain/blob/5235a12da1818aaf2ac075cd4326d46e46da15fc/examples/grammars/json/json.js#L95-L126
And here are the rules defined outside the constructor.
https://github.com/SAP/chevrotain/blob/5235a12da1818aaf2ac075cd4326d46e46da15fc/examples/grammars/json/json.js#L129-L180
I don't think this should be part of Chevrotain's official API
as I would rather wait for class fields proposal, but it can be cleaned up and reused
by end users if needed...
Also note that this factory mixes in the rules, so they could easily be split up
to multiple files for large grammars.
Hope this example demonstrates how due to Chevrotain being a library
instead of a code generator makes it much more malleable to customization. 😄
Wow great work @bd82 and thanks |
Happy to help @deepu105 😄 Latest commit cleaned up a bit and has a small parser test (happy path) which you can debug. |
* Lexer, Parser and APIs in different files. * A single test which parses a simple valid input and outputs a CST.
Wow. Nice! |
* Automatic Error recovery. * Syntatic content assist.
Added some more examples for both syntactic content assist Additionally syntax diagrams can be generated from the grammar. |
81a81dd
to
6df5348
Compare
I think there is now enough content and E2E flows in this POC to be worth discussing and reviewing. |
Added the discussion issue: |
const NAME = chevrotain.createToken({ name: 'NAME', pattern: namePattern }); | ||
|
||
|
||
function createToken(config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two functions could be created and type tests could be avoiided. Like createPatternToken
and createStringToken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is subjective style question, Choose whichever you prefer...
There are actually four/five possible argument types documented in the Docs
In addition using strings is just a convenience style, if you prefer conformity
and no runtime type checks you can replace the strings with regExps, eg:
// both of these are equivalent.
createToken({ name: 'ENTITY', pattern: "entity" });
createToken({ name: 'ENTITY', pattern: /entity/ });
createToken({ name: 'DOT', pattern: '.' }); | ||
|
||
// Imperative the "NAME" token will be added after all the keywords to resolve keywords vs identifier conflict. | ||
tokens.NAME = NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If order matters, are there other rules to remember when writing a parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many rules as with any complex domains.
Most of these rules are not specific to Chevrotain but to general writing of parsers.
E.g: Keywords vs Identifiers in Antlr4
In general Chevrotain tries to detect these issues and provide useful error messages
or even links with detailed instructions[1] [2] on how to resolve those.
But not everything can (currently) be automatically detected (such as keyword vs identifiers).But you just gave me an idea how to automatically detect keywords vs Identifiers!
👍
Awesome work. Though I have no idea how the formatting could be utilized with codeMirror we use in JDL studio, personally formatting is not a requirement at this point but if easy to integrate it would be cool as well |
Neither do I 😄. |
That escalated quickly 😄 |
** DO NOT MERGE WIP**