Skip to content
This repository has been archived by the owner on Mar 21, 2021. It is now read-only.

WIP - POC with Chevrotain Parser. #142

Merged
merged 9 commits into from
Sep 23, 2017
Merged

WIP - POC with Chevrotain Parser. #142

merged 9 commits into from
Sep 23, 2017

Conversation

bd82
Copy link
Contributor

@bd82 bd82 commented Aug 28, 2017

** DO NOT MERGE WIP**

@bd82
Copy link
Contributor Author

bd82 commented Aug 28, 2017

Hi. I'm playing around with re-implementing the JDL parser using the Chevrotain parsing library.

This is related to the technology choice discussion in #141 and the
the preceding discussion in jhipster/generator-jhipster#6275
This may need its own separate issue...

This PR currently contains a lexer for JDL implemented using Chevrotain RegExp based lexer engine.

There are some minor changes from the original JDL pegjs implementation documented in the comments.

The next step is to try and convert the grammar itself (just syntax, no AST building).

@deepu105
Copy link
Member

@bd82 this looks interesting and personally, I like it as it seems more readable than PegJS syntax.

@bd82
Copy link
Contributor Author

bd82 commented Aug 29, 2017

Thanks @deepu105 .

Being an internal DSL the grammar itself is imho a little uglier than pure EBNF style syntax, but still highly readable by using vertical spacing.

Example:

    // comments will be handled outside(after) the parser in this implementation.
    $.RULE('entityBody', () => {
      $.CONSUME(t.LCURLY);
      $.AT_LEAST_ONE(() => {
        $.SUBRULE($.fieldDec);
      });
      $.CONSUME(t.RCURLY);
    });

    $.RULE('fieldDec', () => {
      $.CONSUME(t.NAME);
      $.SUBRULE($.type);
      // Short form for: "(X(,X)*)?"
      $.MANY_SEP({
        SEP: t.COMMA,
        DEF: () => {
          $.SUBRULE($.validation);
        }
      });
      $.CONSUME(t.RCURLY);
    });

With pegjs it would look something like: (taken from existing Grammar)
Which is a-lot more "horizontal"...

entityBody
  = '{' SPACE* fdl:fieldDeclList SPACE* '}' { return fdl; }
  / '' { return []; }

fieldDeclList
  = SPACE* com:comment? SPACE* f:FIELD_NAME SPACE_WITHOUT_NEWLINE* t:type SPACE_WITHOUT_NEWLINE* vl:validationList? SPACE_WITHOUT_NEWLINE* com2:comment? SPACE_WITHOUT_NEWLINE* ','? SPACE* fdl:fieldDeclList {
    return addUniqueElements([{ name: f, type: t, validations: vl, javadoc: com || com2 }], fdl );
  }
  / '' { return []; }

Which is in theory could be prettier, but because many things were added:

  1. "SPACE*" everywhere because pegjs cannot ignore tokens.
  2. labels (**vl:**validationlist?)
  3. JS code snippets to execute (semantic actions).
    • With Chevrotain you can optionally implement the semantic actions outside the grammar.

The end result is much less readable imho as there is no separation of concerns...

But it is not just about readability, it is also about maintainability.
With chevrotain you can place a breakpoint anywhere in your grammar and just debug it
as any other javaScript code you write. :)

How a parser implemented using Chevrotain would look like.

Next step would be some tests to demonstrate capabilities.
@bd82
Copy link
Contributor Author

bd82 commented Aug 29, 2017

Please have a look at a subset of the grammar implemented in latest commit.

The next step would be to add a few tests to demonstrate capabilities on this JDL grammar subset.

  • Autocomplete support.
  • Building an AST using external semantic actions (not embedded in the grammar as with pegjs).
  • Linking jsdocs comments back to the AST.
  • Multiple Syntax Errors (for a single input text).
  • Error Recovery.
  • Extracting data required to implement a JDL code formatter.
  • Syntax Diagrams.

Hopefully I will have time to implement some of these tomorrow.


// HIGHLIGHT:
// "MIN_MAX_KEYWORD" is an "abstract" token which other concrete tokens inherit from.
// This can be used to reduce verbosity in the parser.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in the validation rule instead of specifying the six different keywords.
https://github.com/jhipster/jhipster-core/pull/142/files#diff-802ee05eaf770a8bbbc2fe7ef13a3efaR233

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the corresponding section in the existing grammar:

/ MINLENGTH SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'minlength', value: int }; }
/ MINLENGTH SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'minlength', value: constantName, constant: true }; }
/ MAXLENGTH SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'maxlength', value: int }; }
/ MAXLENGTH SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'maxlength', value: constantName, constant: true }; }
/ MINBYTES SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'minbytes', value: int }; }
/ MINBYTES SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'minbytes', value: constantName, constant: true }; }
/ MAXBYTES SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'maxbytes', value: int }; }
/ MAXBYTES SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'maxbytes', value: constantName, constant: true }; }
/ MIN SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'min', value: int };}
/ MIN SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'min', value: constantName, constant: true }; }
/ MAX SPACE* '(' SPACE* int:INTEGER SPACE* ')' { return { key: 'max', value: int };}
/ MAX SPACE* '(' SPACE* constantName:CONSTANT_NAME SPACE* ')' { return { key: 'max', value: constantName, constant: true }; }

The token inheritance does not have to be used.
It is an example for what is possible and could be considered...

// very important to call this after all the rules have been defined.
// otherwise the parser may not work correctly as it will lack information
// derived during the self analysis phase.
Parser.performSelfAnalysis(this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is during the object's construction-time. All this. Why not having another way?
One file is enough, but putting everything in the constructor isn't really something I look forward to maintaining, even if the improvement of using using Chevrotain over PegJS is obvious. Why not, for instance, use a factory of some sort (a function that calls other functions to build the parser instance)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not having another way?

Answer

The syntax I prefer relies on using class fields ESNext syntax.
https://github.com/tc39/proposal-class-fields
But this is not yet supported afaik (currently stage 3 proposal).
I suppose Babel will support this at some point:
babel/proposals#12

TypeScript has something similar which already works now.
See this example:
This is similar the "official" API I'm aiming for, but may need to wait for ES2018 for that. :(

Alternative

Anyhow as it is all just plain JavaScript you can define it (mostly) however you want...
An extreme example would be this completely different DSL for specifying Chevrotain grammars
https://github.com/kristianmandrup/chevrotain-rule-dsl

I'm am a bit too tired to think a concrete alternative syntax right now.
But I believe one should be possible even with ES6, perhaps you have a suggestion?
The constraints are:

  1. Parser.performSelfAnalysis must be called after the rules have been defined.

    • As it relies on side effects of creating the rules.
  2. The RULE calls must be called in the context of the parser instance (this).

And if it helps normally you only use a single parser instance and reset it's internal state before each use.

Future / Long term.

There is also an open issue for better support of custom APIs for building Chevrotain parsers.
And I'm hoping in the long term to support three different API styles (same as Mocha/Chai have different APIs using the same underlying engine).

  1. Low Level Hand-Built style.
  2. Combinator Style, fluent DSL.
  3. EBNF generator style (Like pegjs).

Copy link
Contributor Author

@bd82 bd82 Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a really quick and dirty factory style hack.
https://github.com/SAP/chevrotain/blob/5235a12da1818aaf2ac075cd4326d46e46da15fc/examples/grammars/json/json.js#L95-L126

And here are the rules defined outside the constructor.
https://github.com/SAP/chevrotain/blob/5235a12da1818aaf2ac075cd4326d46e46da15fc/examples/grammars/json/json.js#L129-L180

I don't think this should be part of Chevrotain's official API
as I would rather wait for class fields proposal, but it can be cleaned up and reused
by end users if needed...

Also note that this factory mixes in the rules, so they could easily be split up
to multiple files for large grammars.

Hope this example demonstrates how due to Chevrotain being a library
instead of a code generator makes it much more malleable to customization. 😄

@deepu105
Copy link
Member

Wow great work @bd82 and thanks

@bd82
Copy link
Contributor Author

bd82 commented Aug 30, 2017

Happy to help @deepu105 😄

Latest commit cleaned up a bit and has a small parser test (happy path) which you can debug.
I plan to add more scenarios (as specs) tomorrow.

* Lexer, Parser and APIs in different files.
* A single test which parses a simple valid input and outputs a CST.
@MathieuAA
Copy link
Member

Wow. Nice!

bd82 and others added 2 commits August 31, 2017 19:36
* Automatic Error recovery.
* Syntatic content assist.
@bd82
Copy link
Contributor Author

bd82 commented Aug 31, 2017

Added some more examples for both syntactic content assist
and for error recovery / fault tolerance.

Additionally syntax diagrams can be generated from the grammar.
This can be useful both for development purposes and as part of a documentation site.
Diagrams of the current sub-grammar

@bd82 bd82 force-pushed the chev branch 4 times, most recently from 81a81dd to 6df5348 Compare September 8, 2017 12:07
@bd82
Copy link
Contributor Author

bd82 commented Sep 8, 2017

I think there is now enough content and E2E flows in this POC to be worth discussing and reviewing.
I will create a separate Issue for this (tomorrow) with specific highlights and links to the source code
to help review this fairly large number of lines.

@bd82
Copy link
Contributor Author

bd82 commented Sep 9, 2017

Added the discussion issue:
#144

const NAME = chevrotain.createToken({ name: 'NAME', pattern: namePattern });


function createToken(config) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two functions could be created and type tests could be avoiided. Like createPatternToken and createStringToken

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subjective style question, Choose whichever you prefer...
There are actually four/five possible argument types documented in the Docs

In addition using strings is just a convenience style, if you prefer conformity
and no runtime type checks you can replace the strings with regExps, eg:

// both of these are equivalent.
createToken({ name: 'ENTITY', pattern: "entity" });
createToken({ name: 'ENTITY', pattern: /entity/ });

createToken({ name: 'DOT', pattern: '.' });

// Imperative the "NAME" token will be added after all the keywords to resolve keywords vs identifier conflict.
tokens.NAME = NAME;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If order matters, are there other rules to remember when writing a parser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many rules as with any complex domains.
Most of these rules are not specific to Chevrotain but to general writing of parsers.
E.g: Keywords vs Identifiers in Antlr4

In general Chevrotain tries to detect these issues and provide useful error messages
or even links with detailed instructions[1] [2] on how to resolve those.

But not everything can (currently) be automatically detected (such as keyword vs identifiers).But you just gave me an idea how to automatically detect keywords vs Identifiers!
👍

@deepu105
Copy link
Member

Awesome work. Though I have no idea how the formatting could be utilized with codeMirror we use in JDL studio, personally formatting is not a requirement at this point but if easy to integrate it would be cool as well

@bd82
Copy link
Contributor Author

bd82 commented Sep 15, 2017

have no idea how the formatting could be utilized with codeMirror we use in JDL studio.

Neither do I 😄.
The important point is that this approach enables future Editor tooling extensions
by keeping all the syntactic data, and more importantly it enables those without needing to modify the parser.

@MathieuAA MathieuAA merged commit b81bb09 into jhipster:master Sep 23, 2017
@deepu105
Copy link
Member

That escalated quickly 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants