Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

have look at magic-strings by way of optimize-js for the jison parser kernel code stripper #2

Closed
GerHobbelt opened this issue Nov 17, 2016 · 2 comments

Comments

@GerHobbelt
Copy link
Owner

For inspiration:

GerHobbelt pushed a commit that referenced this issue Nov 17, 2016
Stop silencing useful information.
@GerHobbelt
Copy link
Owner Author

Since 0.6.0-191: lexer has started to use recast/prettier, though still no lexer-level (action) code analyzer in place.
jison itself (parser generator) still uses regexes for code analysis and code stripping/injecting, but that is to change in the later builds/releases as recast et al get introduced there.

I'll be testing these libraries (and optimize-js + magic-string) further, for I haven't found a clear winner yet. AST traversal and code flow analysis are obviously the way forward to produce leaner (and thus faster) code, but the dev cost is pretty high compared to the current regex-based analysis/strip/inject approach: I find the current method dev-cost heavy, but so is the AST-based alternative.

@GerHobbelt
Copy link
Owner Author

Closing this one. It was just one direction I was looking; the recast/...whatever... work is being done (slowly) and this is now just a hint.

GerHobbelt added a commit that referenced this issue Nov 7, 2020
…code to (temporarily) turn the jison generated source code into 'regular javascript' so we can pull it through standard babel or similar tools. (The previous attempt was to enhance the babel tokenizer and have the jison identifiers processed that way, but given the structure of babel, it meant tracking a slew of large packages, which turned out way too costly. So we revert to this 'Unicode hack' which employs the JavaScript specification about which Unicode characters are *legal in a JavaScript identifier*.

TODO: Should write a blog/article about this.

Here's the comments from the horse's mouth:

---

Determine which Unicode NonAsciiIdentifierStart characters
are unused in the given sourcecode and provide a mapping array
from given (JISON) start/end identifier character-sequences
to these.

The purpose of this routine is to deliver a reversible
transform from JISON to plain JavaScript for any action
code chunks.

This is the basic building block which helps us convert
jison variables such as `$id`, `$3`, `$-1` ('negative index' reference),
`@id`, `#id`, `#TOK#` to variable names which can be
parsed by a regular JavaScript parser such as esprima or babylon.

```
function generateMapper4JisonGrammarIdentifiers(input) { ... }
```

IMPORTANT: we only want the single char Unicodes in here
so we can do this transformation at 'Char'-word rather than 'Code'-codepoint level.

```
const IdentifierStart = unicode4IdStart.filter((e) => e.codePointAt(0) < 0xFFFF);
```

As we will be 'encoding' the Jison Special characters @ and # into the IDStart Unicode
range to make JavaScript parsers *not* barf a hairball on Jison action code chunks, we
must consider a few things while doing that:

We CAN use an escape system where we replace a single character with multiple characters,
as JavaScript DOES NOT discern between single characters and multi-character strings: anything
between quotes is a string and there's no such thing as C/C++/C#'s `'c'` vs `"c"` which is
*character* 'c' vs *string* 'c'.

As we can safely escape characters, all we need to do is find a character (or set of characters)
which are in the ID_Start range and are expected to be used rarely while clearly identifyable
by humans for ease of debugging of the escaped intermediate values.

The escape scheme is simple and borrowed from ancient serial communication protocols and
the JavaScript string spec alike:

- assume the escape character is A
- then if the original input stream includes an A, we output AA
- if the original input includes a character #, which must be escaped, it is encoded/output as A

This is the same as the way the backslash escape in JavaScript strings works and has a minor issue:
sequences of AAA with an odd number of A's CAN occur in the output, which might be a little hard to read.
Those are, however, easily machine-decodable and that's what's most important here.

To help with that AAA... issue AND because we need to escape multiple Jison markers, we choose to
a slightly tweaked approach: we are going to use a set of 2-char wide escape codes, where the
first character is fixed and the second character is chosen such that the escape code
DOES NOT occur in the original input -- unless someone would have intentionally fed nasty input
to the encoder as we will pick the 2 characters in the escape from 2 utterly different *human languages*:

- the first character is ဩ which is highly visible and allows us to quickly search through a
  source to see if and where there are *any* Jison escapes.
- the second character is taken from the Unicode CANADIAN SYLLABICS range (0x1400-0x1670) as far as
  those are part of ID_Start (0x1401-0x166C or there-abouts) and, unless an attack is attempted at jison,
  we can be pretty sure that this 2-character sequence won't ever occur in real life: even when one
  writes such a escape in the comments to document this system, e.g. 'ဩᐅ', then there's still plenty
  alternatives for the second character left.
- the second character represents the escape type: $-n, $#, #n, @n, #ID#, etc. and each type will
  pick a different base shape from that CANADIAN SYLLABICS charset.
- note that the trailing '#' in Jison's '#TOKEN#' escape will be escaped as a different code to
  signal '#' as a token terminator there.
- meanwhile, only the initial character in the escape needs to be escaped if encountered in the
  original text: ဩ -> ဩဩ as the 2nd and 3rd character are only there to *augment* the escape.
  Any CANADIAN SYLLABICS in the original input don't need escaping, as these only have special meaning
  when prefixed with ဩ
- if the ဩ character is used often in the text, the alternative ℹ இ ண ஐ Ϟ ല ઊ characters MAY be considered
  for the initial escape code, hence we start with analyzing the entire source input to see which
  escapes we'll come up with this time.

The basic shapes are:

- 1401-141B:  ᐁ             1
- 142F-1448:  ᐯ             2
- 144C-1465:  ᑌ             3
- 146B-1482:  ᑫ             4
- 1489-14A0:  ᒉ             5
- 14A3-14BA:  ᒣ             6
- 14C0-14CF:  ᓀ
- 14D3-14E9:  ᓓ             7
- 14ED-1504:  ᓭ             8
- 1510-1524:  ᔐ             9
- 1526-153D:  ᔦ
- 1542-154F:  ᕂ
- 1553-155C:  ᕓ
- 155E-1569:  ᕞ
- 15B8-15C3:  ᖸ
- 15DC-15ED:  ᗜ            10
- 15F5-1600:  ᗵ
- 1614-1621:  ᘔ
- 1622-162D:  ᘢ

## JISON identifier formats ##

- direct symbol references, e.g. `#NUMBER#` when there's a `%token NUMBER` for your grammar.
  These represent the token ID number.

  -> (1+2) start-# + end-#

- alias/token value references, e.g. `$token`, `$2`

  -> $ is an accepted starter, so no encoding required

- alias/token location reference, e.g. `@token`, `@2`

  -> (6) single-@

- alias/token id numbers, e.g. `#token`, `#2`

  -> (3) single-#

- alias/token stack indexes, e.g. `##token`, `##2`

  -> (4) double-#

- result value reference `$$`

  -> $ is an accepted starter, so no encoding required

- result location reference `@$`

  -> (6) single-@

- rule id number `#$`

  -> (3) single-#

- result stack index `##$`

  -> (4) double-#

- 'negative index' value references, e.g. `$-2`

  -> (8) single-negative-$

- 'negative index' location reference, e.g. `@-2`

  -> (7) single-negative-@

- 'negative index' stack indexes, e.g. `##-2`

  -> (5) double-negative-#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant