-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
have look at magic-strings by way of optimize-js for the jison parser kernel code stripper #2
Comments
Stop silencing useful information.
Since 0.6.0-191: lexer has started to use recast/prettier, though still no lexer-level (action) code analyzer in place. I'll be testing these libraries (and optimize-js + magic-string) further, for I haven't found a clear winner yet. AST traversal and code flow analysis are obviously the way forward to produce leaner (and thus faster) code, but the dev cost is pretty high compared to the current regex-based analysis/strip/inject approach: I find the current method dev-cost heavy, but so is the AST-based alternative. |
Closing this one. It was just one direction I was looking; the recast/...whatever... work is being done (slowly) and this is now just a hint. |
…code to (temporarily) turn the jison generated source code into 'regular javascript' so we can pull it through standard babel or similar tools. (The previous attempt was to enhance the babel tokenizer and have the jison identifiers processed that way, but given the structure of babel, it meant tracking a slew of large packages, which turned out way too costly. So we revert to this 'Unicode hack' which employs the JavaScript specification about which Unicode characters are *legal in a JavaScript identifier*. TODO: Should write a blog/article about this. Here's the comments from the horse's mouth: --- Determine which Unicode NonAsciiIdentifierStart characters are unused in the given sourcecode and provide a mapping array from given (JISON) start/end identifier character-sequences to these. The purpose of this routine is to deliver a reversible transform from JISON to plain JavaScript for any action code chunks. This is the basic building block which helps us convert jison variables such as `$id`, `$3`, `$-1` ('negative index' reference), `@id`, `#id`, `#TOK#` to variable names which can be parsed by a regular JavaScript parser such as esprima or babylon. ``` function generateMapper4JisonGrammarIdentifiers(input) { ... } ``` IMPORTANT: we only want the single char Unicodes in here so we can do this transformation at 'Char'-word rather than 'Code'-codepoint level. ``` const IdentifierStart = unicode4IdStart.filter((e) => e.codePointAt(0) < 0xFFFF); ``` As we will be 'encoding' the Jison Special characters @ and # into the IDStart Unicode range to make JavaScript parsers *not* barf a hairball on Jison action code chunks, we must consider a few things while doing that: We CAN use an escape system where we replace a single character with multiple characters, as JavaScript DOES NOT discern between single characters and multi-character strings: anything between quotes is a string and there's no such thing as C/C++/C#'s `'c'` vs `"c"` which is *character* 'c' vs *string* 'c'. As we can safely escape characters, all we need to do is find a character (or set of characters) which are in the ID_Start range and are expected to be used rarely while clearly identifyable by humans for ease of debugging of the escaped intermediate values. The escape scheme is simple and borrowed from ancient serial communication protocols and the JavaScript string spec alike: - assume the escape character is A - then if the original input stream includes an A, we output AA - if the original input includes a character #, which must be escaped, it is encoded/output as A This is the same as the way the backslash escape in JavaScript strings works and has a minor issue: sequences of AAA with an odd number of A's CAN occur in the output, which might be a little hard to read. Those are, however, easily machine-decodable and that's what's most important here. To help with that AAA... issue AND because we need to escape multiple Jison markers, we choose to a slightly tweaked approach: we are going to use a set of 2-char wide escape codes, where the first character is fixed and the second character is chosen such that the escape code DOES NOT occur in the original input -- unless someone would have intentionally fed nasty input to the encoder as we will pick the 2 characters in the escape from 2 utterly different *human languages*: - the first character is ဩ which is highly visible and allows us to quickly search through a source to see if and where there are *any* Jison escapes. - the second character is taken from the Unicode CANADIAN SYLLABICS range (0x1400-0x1670) as far as those are part of ID_Start (0x1401-0x166C or there-abouts) and, unless an attack is attempted at jison, we can be pretty sure that this 2-character sequence won't ever occur in real life: even when one writes such a escape in the comments to document this system, e.g. 'ဩᐅ', then there's still plenty alternatives for the second character left. - the second character represents the escape type: $-n, $#, #n, @n, #ID#, etc. and each type will pick a different base shape from that CANADIAN SYLLABICS charset. - note that the trailing '#' in Jison's '#TOKEN#' escape will be escaped as a different code to signal '#' as a token terminator there. - meanwhile, only the initial character in the escape needs to be escaped if encountered in the original text: ဩ -> ဩဩ as the 2nd and 3rd character are only there to *augment* the escape. Any CANADIAN SYLLABICS in the original input don't need escaping, as these only have special meaning when prefixed with ဩ - if the ဩ character is used often in the text, the alternative ℹ இ ண ஐ Ϟ ല ઊ characters MAY be considered for the initial escape code, hence we start with analyzing the entire source input to see which escapes we'll come up with this time. The basic shapes are: - 1401-141B: ᐁ 1 - 142F-1448: ᐯ 2 - 144C-1465: ᑌ 3 - 146B-1482: ᑫ 4 - 1489-14A0: ᒉ 5 - 14A3-14BA: ᒣ 6 - 14C0-14CF: ᓀ - 14D3-14E9: ᓓ 7 - 14ED-1504: ᓭ 8 - 1510-1524: ᔐ 9 - 1526-153D: ᔦ - 1542-154F: ᕂ - 1553-155C: ᕓ - 155E-1569: ᕞ - 15B8-15C3: ᖸ - 15DC-15ED: ᗜ 10 - 15F5-1600: ᗵ - 1614-1621: ᘔ - 1622-162D: ᘢ ## JISON identifier formats ## - direct symbol references, e.g. `#NUMBER#` when there's a `%token NUMBER` for your grammar. These represent the token ID number. -> (1+2) start-# + end-# - alias/token value references, e.g. `$token`, `$2` -> $ is an accepted starter, so no encoding required - alias/token location reference, e.g. `@token`, `@2` -> (6) single-@ - alias/token id numbers, e.g. `#token`, `#2` -> (3) single-# - alias/token stack indexes, e.g. `##token`, `##2` -> (4) double-# - result value reference `$$` -> $ is an accepted starter, so no encoding required - result location reference `@$` -> (6) single-@ - rule id number `#$` -> (3) single-# - result stack index `##$` -> (4) double-# - 'negative index' value references, e.g. `$-2` -> (8) single-negative-$ - 'negative index' location reference, e.g. `@-2` -> (7) single-negative-@ - 'negative index' stack indexes, e.g. `##-2` -> (5) double-negative-#
For inspiration:
The text was updated successfully, but these errors were encountered: