Skip to content
This repository has been archived by the owner on Sep 2, 2023. It is now read-only.

Tooling: Brainstorming ideas that can lead to efficient loader-oriented designs #203

Closed
SMotaal opened this issue Oct 17, 2018 · 93 comments
Closed
Labels
brainstorming Safe place to discuss ideas and provide constructive feedback cjs discussion interoperability

Comments

@SMotaal
Copy link

SMotaal commented Oct 17, 2018

Having both ecmascript-modules and @jkrems hackable loader has opened up tremendous scope for experimentation.

Note: This thread does not make claims for or against existing tooling, some of which have stood the test of time, evolved, and are fixtures of the ecosystem. The intent is simply to consider different perspectives being explored in experimental efforts.

As far as things go, the broad range of tooling that applies to loaders basically iterates over productions in each source, irrespective of the specifics of implementation or operations.

Most tools are designed to be used for much more complex applications than merely loading. To that effect, they often avoid the use of new language features that would prevent them from working on older platforms. They can also avoid new features which may have been prematurely associated with inefficiencies in early stages. Some are also built with infrastructures or features that are not ideal or not optimized specifically for loading, like using workers, verbose error checking (ie as a language service)... etc.

I would like to dedicate this thread to brainstorming experimental or just different ideas to implement related patterns for loader-first designs.

Brainstorming: A safe place to discuss ideas and provide constructive feedback

How to contribute

Please avoid emoting that can be confusing (especially if it can construed as passively aggressive)

😄 Indication
👍 To indicate a "Yes" response
👎 To indicate a "No" response
🎉 To indicate a "Aha" moment

Read the Digest

The following is a set of ideas or conclusions curated from the discussions:

Syntax Detection (CJS vs ESM)

  • Safely using RegExp — @SMotaal

    • requires guarding against string hijacking — @devsnek
    • recommend using acorn instead — @devsnek
  • Fallback for ESM without import and export@targos

    • shouldn't use import(…) to resolve ambiguity — @bmeck
    • can use import.meta@bmeck
  • Dual parsing a module was deemed inefficient — @MylesBorins

Syntax Identification (CJS vs ESM)

  • Mime type meta data via something like webpackage — @jkrems

  • Magic bytes — @jkrems

Wrapping CJS in an ESM module system

@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

ECMAScript modules syntax can arguably be detected using a RegExp which bails on first match.

Does anyone have ideas for cjs vs esm syntax detection?

@SMotaal SMotaal changed the title Tooling: Using new language features to design efficient loader extensions Tooling: Using new language features to design efficient loader-first extensions Oct 17, 2018
@devsnek
Copy link
Member

devsnek commented Oct 17, 2018

@SMotaal you can't use regexp to parse js grammar (you can always make a pattern of string literals or whatever to confuse the regexp) and the differences between valid cjs and valid esm are ambiguous and can't be reliably detected by just looking at the code.

@SMotaal SMotaal added the brainstorming Safe place to discuss ideas and provide constructive feedback label Oct 17, 2018
@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

you can always make a pattern of string literals or whatever to confuse the regexp

So, can we constructively say that so long as you guard against string hijacking (maybe there is a better term for this), only then can you safely use RegExp?

@devsnek
Copy link
Member

devsnek commented Oct 17, 2018

@SMotaal I would just use acorn

@SMotaal

This comment has been minimized.

@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

@devsnek humor me in this effort, consider this both an idea-gathering as well as a team-building exercise. Acron is obviously a great solution, but I am trying to create opportunities for people to talk about the aspects that make this and others such great tools. The notion here is that people might just have some evolving ideas that they might want to bounce around. How we connect the dots, like you pointing out the hijacking limitation can potentially inspire untapped solutions to existing problems.

Sounds fair?

@targos
Copy link
Member

targos commented Oct 17, 2018

@SMotaal You could say that a file with import or export syntax is probably an ES Module (the syntax is invalid in Script mode). However, the problem is that files without import and export could be either Script or Module, and depending on how they are written, could have different behaviour in Script vs Module mode.

For example:

test = 42;

In Script mode, this creates the property test on the global object.
In Module mode, this throws a ReferenceError.

@benjamingr
Copy link
Member

@targos does the issue get any better if we say that such a loader always imports CJS in strict mode regardless of an explicit "use strict"?

@devsnek
Copy link
Member

devsnek commented Oct 17, 2018

are we trying to come up with use cases for loaders or something else?

if you're using a resolve loader hook you'll always be able to read the contents of whatever you're resolving, at which point you can regex or acorn or whatever it as you see fit.

@targos
Copy link
Member

targos commented Oct 17, 2018

I'm having trouble to see the relation between 'cjs vs esm syntax detection" and the OP. Maybe I don't really understand what this thread is about, sorry.

@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

@targos Actually, I think you are hitting the nail with pointing out that:

without import and export [a file] could be either Script or Module

So would it be possible to say that when dealing with ambiguous code, syntax-based detection is possible for ECMAScript Modules (ie having those explicit syntaxes import and export) as long as there is a mechanism to fallback on when those features are not present.

Sounds right?

@devsnek
Copy link
Member

devsnek commented Oct 17, 2018

@SMotaal you could always fall back to your own opinions of what the file should be but its impossible to know the author's intent.

i agree with targos that i have no idea what this thread is for.

@SMotaal

This comment has been minimized.

@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

@devsnek The ideas you are all expressing here are extremely valuable, they allow others to actually learn or at least consider a different perspective. It also makes it easier for people to be able to better appreciate and understand intent in future discussions. I think that the biggest problem is not that people disagree, this is actually not bad, but more so that sometimes we tend to do but end up arguing in two separate directions due to miscommunication and misunderstanding.

@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

@benjamingr I might be mistaken, but I believe that it is possible to evaluate non-strict code. While I am not certain how --experimental-modules handle it, I believe that if the wrapper function expression is evaluated in a non-strict context, it will only be strict if "use strict" is in the body of the wrapped module. I played around with this a bit when experimenting with realms which is stage 2 and still actively being updated.

@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

I'm having trouble to see the relation between 'cjs vs esm syntax detection" and the OP. Maybe I don't really understand what this thread is about, sorry.

@targos Until we actually figure out how the Modules WG will handle source ambiguity, it can be helpful to explore (maybe even POC) the various ways to achieve it. Thinking of any of those ideas as either core vs extensions is premature, but that should not discourage efforts of reasoning about it and trying to find ways to refine them irrespective of where those aspects end up.

@MylesBorins
Copy link
Contributor

MylesBorins commented Oct 17, 2018 via email

@SMotaal SMotaal changed the title Tooling: Using new language features to design efficient loader-first extensions Tooling: Brainstorming ideas that can lead to efficient loader-oriented designs Oct 17, 2018
@devsnek
Copy link
Member

devsnek commented Oct 17, 2018

@SMotaal so we're discussing how the default loader should handle source ambiguity?

@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

@MylesBorins I think it may be important to know more about the dual parsing approach. I state this not to suggest that dual parsing is or is not a solution, but rather to see if a different parsing approach may be something worth exploring.

From my own research (which I know is relatively limited to other folks in this space), I often find the common pattern of tokenizing into ASTs, which in many cases seems to be an eagerly contiguous process, which makes sense for many things, especially for transforms. In contrast loader-first tokenization (AST or not) may be more efficient if it bailes out on the first conclusively deterministic feature, and more so if it is possible to have a non-binary intent which would allow a single scan to be used.

Can you shed some light on the methodology? (maybe a link to follow-up)

@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

@SMotaal so we're discussing how the default loader should handle source ambiguity?

@devsnek I think of this as a parallel discussion altogether, not intended to directly affect other discussions that deal with the specifics of the default loader... etc. That said, there is no harm if we end up drawing some conclusions that positively influence our process in general.

@bmeck
Copy link
Member

bmeck commented Oct 17, 2018

@MylesBorins the inefficiency is tolerable as @jdalton shows with a top level parse, which would be much faster though if v8 directly supported such mechanisms. However, as the language increases, there are a few concerns:

Some heuristics may fail/be unreliable as features get added to different modes:

  • Some Modules may only use import(), which is available in both goals. What should we do with this?
  • Currently import.meta is only in Module, but certainly could be proposed to come to Script. If it gets added to Script would that mean that a Source Text would change from a Module to Script because the language added a feature to Script?

We could probably think of more as we desire, but the idea of what to do in ambiguous cases seems a bit beyond scope of tooling itself, these would need to be definitive answers that we can provide a direct answer to as they come up.

@SMotaal Per the question about how does the current loader load non-strict code. It uses multiple Source Texts, it does not create a single string that has both Module and CJS code. You cannot inline a sloppy source text into ESM without using Function which would not have direct access to local variables and they would need to be passed in. Ideally, we could avoid using Function to avoid double parsing the same string somehow and violates some people wishing to prevent JS based codegen for security reasons (see things like CSP or v8's SetAllowCodeGenerationFromStringsCallback), but someone may think of a reason why it would be useful to keep.

@jkrems
Copy link
Contributor

jkrems commented Oct 17, 2018

Things that are possible:

  1. Mime type meta data via something like webpackage - e.g. moving away from files on disk for dependencies. This should also allow for faster loading because you're not trying to load millions of tiny files individually.
  2. Magic bytes. This works for WASM, to a lesser degree for JSON, not really for JS (script vs. CJS vs. module).
  3. A magical bridge protocol, query/hash param, or some other import-site mechanism. Downside: This introduces n bugs for n imports of the same file.

Agreed with above - there's things that are really hard to figure out and "run CJS with implicit strict" might work for app code but not for dependencies. We tried. It breaks with things like:

if (cond) {
  /* [...] */

  function myHelper(el) { /* [...] */ }
  someArr.forEach(myHelper);
}

The above will throw in strict mode IIRC and this pattern does appear in real (popular) npm modules.

@SMotaal
Copy link
Author

SMotaal commented Oct 17, 2018

It uses multiple Source Texts, it does not create a single string that has both Module and CJS code. You cannot inline a sloppy source text into ESM without using Function which would not have direct access to local variables and they would need to be passed in.

@bmeck Absolutely… I was inspired by this approach in the early days of --experimental-modules and found a lot uses of for it beyond CJS in a more general sense.

@benjamingr does that align with the concerns you raised?

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented Oct 17, 2018

@SMotaal if you search this repo for “unambiguous grammar” or “unambiguous syntax” you'll find lots of discussion on this topic.

The webpackage idea from @jkrems does give me one idea though: what if an import statement of a file always imports as ESM, and it's only importing of packages where importing of CommonJS is possible? The package.json is a metadata file about the package, capable of holding properties like module parse goal. It's much more capable as a metadata repository than a file extension is. And if someone wants to import a loose CommonJS file into an ESM module, well, we built createRequireFromUrl for that.

@jkrems
Copy link
Contributor

jkrems commented Oct 17, 2018

Side note: I also dislike that a single --loader array means there's now an order dependence for when exactly which loader needs to be passed. In a world where there's phases, they could be passed in any order.

@bmeck
Copy link
Member

bmeck commented Oct 17, 2018

@jkrems

Side note: I also dislike that a single --loader array means there's now an order dependence for when exactly which loader needs to be passed. In a world where there's phases, the could be passed in any order.

The load order is still required even with phases, if one phase loader always guess the type to be text/javascript and doesn't properly delegate to another that would guess it to be application/wasm, flipping the order of those loaders would still mean a change in behavior. Phases do not fix load ordering, we must rely on users to properly configure things.

Yes, but the question was about adding support for WASM. Or HTML. Or binary AST. Or anything else that isn't a simple compilation into an equivalent ESM source text. The resource fetch can get the data but that isn't the actually interesting bit for those. The interesting bit is taking the data and turning it into something that can be linked into the module graph (in the above: an init hook).

I don't understand how this relates, like I said, any supported format that Node can link into a graph works. This is unrelated and doesn't need a separate phase. Even with an init phase like you propose, if ESM linking cannot directly integrate with a WASM module because the host doesn't provide a way, you still must create a facade in your proposal.

Also, the webpackage example should start with http://some-url-in-the-package. In your example - what would the webpackage loader receive?

I don't understand this. Webpackage could support file: last I saw, I can update the strings to have file: in them I guess in the example.

@jkrems
Copy link
Contributor

jkrems commented Oct 17, 2018

I don't understand this. Webpackage could support file: last I saw, I can update the strings to have file: in them I guess in the example.

But why would it be limited to file: URLs? Especially since those would risk conflicting with real on-file URLs. A portable webpackage provided by a registry should either use HTTPS URLs (that could even resolve potentially) or a custom scheme. Reusing file: would mean that you'd end up hitting the disk for every file first and worst case even load something.

@jkrems
Copy link
Contributor

jkrems commented Oct 17, 2018

Phases do not fix load ordering, we must rely on users to properly configure things.

They do fix ordering for unrelated concerns, like fetching a resource and actually interpreting it.

any supported format that Node can link into a graph works.

So the disagreement is if init should be exposed or not, not if it is a separate phase. Because if init is hard-coded to a well-known list of supported module types, then it's still there, just not configurable.

@bmeck
Copy link
Member

bmeck commented Oct 18, 2018

So the disagreement is if init should be exposed or not, not if it is a separate phase. Because if init is hard-coded to a well-known list of supported module types, then it's still there, just not configurable.

I'm saying init doesn't make sense as you are explaining it, you can't make the VM accept new unknown Module types into the graph. Same way, you can't just make new Module types work in Node. There is always a minimal set. CoffeeScript modules could compile to WASM or JS, it doesn't matter, but we can't suddenly make V8 accept something like JVM bytecode and have it act as a Module without turning it into a supported Module type.

@bmeck
Copy link
Member

bmeck commented Oct 18, 2018

But why would it be limited to file: URLs? Especially since those would risk conflicting with real on-file URLs. A portable webpackage provided by a registry should either use HTTPS URLs (that could even resolve potentially) or a custom scheme. Reusing file: would mean that you'd end up hitting the disk for every file first and worst case even load something.

It isn't? It accepts the full specifier and id of the module loading some dependency, I would expect there to be no constraints except that the id should be unique, and the specifier is a string.

@jkrems
Copy link
Contributor

jkrems commented Oct 19, 2018

It isn't?

Ah, I misread your example code. My bad.

but we can't suddenly make V8 accept something like JVM bytecode and have it act as a Module without turning it into a supported Module type.

Yes, but turning it into a supported Module type doesn't necessarily mean turning it into a supported module type source code. E.g. for WASM (or for the JVM bytecode example actually), you would realistically analyze/compile the resource content first to determine the interface, then generate a facade, and then expose the compilation result inside of the module. Trying to inline the original bytes in the source text and then recompiling on execution would be fairly inefficient and in some cases not practical. The only alternative I can think of is globals and unique ids but that's not really a proper solution.

For me CoffeeScript isn't the target I'd want to optimize for. If what your loading can easily be converted into self-contained JS code on the fly, it might just as well have been compiled ahead of time. The same isn't true for things that do not compile to JS and have different execution semantics. One example would be importing a DLL for example.

@bmeck
Copy link
Member

bmeck commented Oct 19, 2018

Yes, but turning it into a supported Module type doesn't necessarily mean turning it into a supported module type source code. E.g. for WASM (or for the JVM bytecode example actually), you would realistically analyze/compile the resource content first to determine the interface, then generate a facade, and then expose the compilation result inside of the module. Trying to inline the original bytes in the source text and then recompiling on execution would be fairly inefficient and in some cases not practical.

We already have an example that doesn't do inline based transformation, currently our CJS translator is creating a separate module record and just loading in the CJS without inlining it. It doesn't recompile on execution at all currently.

The only alternative I can think of is globals and unique ids but that's not really a proper solution.

Modules will need unique ids anyway in order to ensure the (module, specifier) pair is unique. I'm not sure how any other solution could be "proper" since without unique ids that makes the pair unable to correctly have a 1-1 relationship with an import.

For me CoffeeScript isn't the target I'd want to optimize for. If what your loading can easily be converted into self-contained JS code on the fly, it might just as well have been compiled ahead of time. The same isn't true for things that do not compile to JS and have different execution semantics. One example would be importing a DLL for example.

It isn't just CoffeeScript that does JS compilation; historically code coverage has done this (no longer!!!), eslint certainly could be useful to enforce at boot time, development runs without having 2 commands for build vs run, etc.

It certainly isn't the only thing we should optimize for, but it is part of it. If the concern is mostly around avoiding duplicate parse/eval phases that is something we can design around, but I don't see how init solves this in any new way.

@jkrems
Copy link
Contributor

jkrems commented Oct 19, 2018

init allows us to officially support initializing a module using "real" APIs. E.g. module.setLazyDynamicExports(exportLists, getExports) or whatever the final API could look like. With loader hooks that can just returns bytes, this will always be somewhat awkward and indirect. Afaik our CJS translator isn't implemented as a loader hook that just spits out bytes..?

@bmeck
Copy link
Member

bmeck commented Oct 19, 2018

@jkrems

Afaik our CJS translator isn't implemented as a loader hook that just spits out bytes..?

Correct. It currently doesn't, but if we wanted to we could rewrite it to do so. I'm not sure if that information is for or against anything given that.

init allows us to officially support initializing a module using "real" APIs. E.g. module.setLazyDynamicExports(exportLists, getExports) or whatever the final API could look like.

If you follow the Realms proposal there are fewer JS APIs being considered and most interactions for things are being moved to be purely string based. I'm not sure what APIs are being talked about here.

@SMotaal
Copy link
Author

SMotaal commented Oct 23, 2018

As I catch up on this thread, I am appreciating how everyone tries to follow a more brainstorming approach to allow everyone to pose ideas to see how they materialize (or not) later on.

I think this type of discussion helps people with very diverse backgrounds, experiences, and extents of familiarity with the intricacies of ESM and CJS to mutually share and gain insights that are sometimes missed during goal-oriented debates.

@SMotaal
Copy link
Author

SMotaal commented Oct 23, 2018

On the idea of top-level parsing to disambiguate JS sources. I took some time to put together an experiment to roughly demonstrate the relative costs associated with different parsing strategies.

The gist of it is that a parser would bail out at the first occurence of a particular syntax, where it will parse through the entire file length otherwise, using as little grammars as possible for a safe parse. The current experiment does not bail out, it simply identifies escapable entities that can be used for hijacking, contextualizing symbols, and the set of keywords that would satisfy the condition.

I added new parsing modes to the experimental parser "esm", "cjs" and "esx". In "esm", the parser will operate in strictly top-level and only look for the keywords import, export, from, as (for completeness). In "cjs", it will parse deep and look for keywords module and exports (though they are really not keywords, still working on that). In "esx", it will parse deep and also look for the combined set of keywords, with the intent to consider a single differential parse versus multiple binary parses. The "es" mode is an incomplete mode intended for full source analysis.

The demo page is served from http://smotaal.github.io/experimental/markup/markup using the ordered parametric notation: #[url]![mode]*[replicates]**[iterations]. If mode is omitted, it is inferred from content-type. If iterations are specified and ≥ 1, a separate loop will run the tokenizer on the same code without rendering it (average time of loops will be shown, for sampling purposes). Replicates, which are not needed for this demo, if ≥ 2, the source text is repeated, so it will parse and render x repeats of the original text as a single source text, however, if you are working with really large sources (like babel) try *0**[iterations] to eliminate rendering overhead which can crash in some browsers.

Demo: acorn.mjs

Demo: acorn.js

Note: This experimental code works in the latest Chrome, Safari, and Firefox Nightly with varying performance. If you try this on a slower device, use a smaller source and change the ** iterations value as needed. All parsing happens in the main thread.

Obviously this does not address disambiguation of ambiguous source texts. If relative performance gains can be further improved or optimized, then disambiguation (loader or not) by source text will be something will likely be favoured by some down the road.

@GeoffreyBooth
Copy link
Member

@SMotaal That’s a great start for something that I can see as a loader. For your CommonJS detection I would add a check to look for globally-referenced require.

Perhaps it would be good to start compiling a list somewhere of things that people might want to see as loaders. Besides this case, off the top of my head there’s transpilers, automatic completion of file extensions/folder root files, configuration of module loading behavior based on file extension, and general backward compatibility to bridge the gap between what will be possible in ESM in Node and what is/was possible/allowed in Babel and other transpiled/built versions of ESM.

@SMotaal
Copy link
Author

SMotaal commented Oct 25, 2018

@GeoffreyBooth obviously my efforts are gearing towards complimenting any potential implementations for loaders once the design process matures, but for this particular experiment, I decided to isolate for any such efforts and instead tried to focus on some proxy problems. In my effort, thinking of syntax highlighting was a great way to visually solve parsing challenges, and parsing in the main thread without dependencies was a great way to address performance issues, adhering to generators was a great way to force a stream-like approach... the list goes on.

Regardless, it was just perfectly timed to use it to demo relative performance gains compared to more common AST all the things then do one small thing, which would be rather expansive for esm vs cjs (or my proposed esx) detection in my best estimate (but still needs real world benchmarks).

@SMotaal
Copy link
Author

SMotaal commented Oct 25, 2018

ESX parsing currently scans the full length of source text, but the intent is to actually keep reference of enclosing ranges and not analyze them unless there are no signals of ESM syntax on top-level, then finally scan enclosures to find the first cjs hint or not, this makes it possible to report ESM, CJS, or still ambiguous so use the default based on out-of-band settings... etc.

@ljharb
Copy link
Member

ljharb commented Oct 25, 2018

If we have out of band settings, and that info conflicts with a parsed result, I’d expect it to throw - the two shouldn’t be in disagreement.

@SMotaal
Copy link
Author

SMotaal commented Oct 26, 2018

That’s actually a very important aspect, because I in my rushed vision of eliminating parsing errors which are handled normally by the runtime, I have not given thought to certain errors that belong specifically to the intent at hand.

More of this kind of insights here can go a long way down the road when making decisions. Awesome 🙂

@SMotaal

This comment has been minimized.

@SMotaal
Copy link
Author

SMotaal commented Oct 26, 2018

When considering the case of parsing, I was having trouble mentally placing the metadata communicated between two loaders for instance.

In this case, it is in-band (imo) but it is not "directly" from source, it is inferred and attributed to the source text, and is triggered (or bypassed) and responds to out-of-band (one-to-many) and out-of-source (one-to-one) aspects or settings.

Can I propose the following complementary pairs: (examples in brackets)

  1. "out-of-band" — setting that trickles down to one or more resolved specifiers (flag, ext, mime…)
  2. "in-band" — settings determined from resolved source features (pragma, this parse…).
  1. "from-source" — settings declared in the source text (pragma, shebang…)
  2. "out-of-source" — settings inferred or attributed to a source text (some in- and out-of-band)

Can anyone find a more practical breakdown of such information regarding a source text's journey?

This is all crude thoughts, it needs magic from the group. I feel that a distinction between what maps to sources and what is specific to a source but not baked right into the body are essential distinctions.

@SMotaal
Copy link
Author

SMotaal commented Oct 28, 2018

I finally updated the README and pushed the revisions made last week. Timing is more accurate now. I also converted the rendering pipeline to async APIs. Tokenization APIs remain sync but use generators so they yield and return as needed. I improved the modes for esm, cjs, esx, and added the missing alias es for the regular javascript syntax mode.

I am really interested to hear some feedback on the three modes (esm, cjs, esx) with various sources, especially if you find a source that breaks or chokes in one of those modes.

@devsnek
Copy link
Member

devsnek commented Oct 28, 2018

@SMotaal its cool i guess? i don't really understand why we have an issue open for it though.

@SMotaal
Copy link
Author

SMotaal commented Oct 28, 2018

@devsnek This thread is about ideas in general separate from implementation. As we move closer to loaders and defaults, those discussions and demos can be helpful, at the very least, they can serve as a reference for those who need to find more about them.

@SMotaal
Copy link
Author

SMotaal commented Nov 5, 2018

@jdalton Can you pitch in on the idea of syntax detection relating to top-level parse. I tried to find a way to model this to the benefit of everyone in the group and was able to show a 200% increase in performance (theoretical) relative to the same method to full ES grammar parsing like ASTs would.

This was done avoiding the conventional all-or-nothing AST approach, using half-way optimized RegExps addressing usual concerns like hijacking.

Ideas like dual-parsing (@MylesBorins) and your top-level parse (@bmeck) made me think of a single-parse limited to the minimal subset of both grammars and it was roughly capped at 175% depending on nested complexity but on average better than 150%.

Since we're trying to find the first clue to determine syntax, the expectation is that such clues will often materialize early on in a text, making it reasonable to bail or delay the rest of the parsing (if at all needed).

Can we hash out pseudo code for syntax determination based on your initial thoughts on top-level parse?


About this thread…

I'm trying to brainstorm ideas parallel to our implementation efforts that make it possible for our broadly diverse members to appreciate the various technical challenges associated with decisions we are making.

Based on an early digest of this discussion, which I took liberty to summarize at top. I tried to pick ideas which seemed to create rifts in discussions elsewhere, mainly in on the topics of syntax detection and interoperability.

@GeoffreyBooth
Copy link
Member

@SMotaal This is impressive . . . just to understand what you’ve done here, is your goal to determine parse goal by analyzing the syntax? A.k.a. a real implementation of the “unambiguous syntax”/grammar that we’ve been discussing?

If so, and assuming that you find an algorithm that works, have you thought about how to address the related concerns listed in #150 (comment)?

@devsnek
Copy link
Member

devsnek commented Nov 5, 2018

to be clear, it's just a lightweight way of parsing js. this doesn't make the ambiguity go away.

@ljharb
Copy link
Member

ljharb commented Nov 5, 2018

Confirmed; there does not exist any approach based on parsing that is unambiguous in all cases, absent a language spec change.

@SMotaal
Copy link
Author

SMotaal commented Nov 5, 2018

Yeah, while I would love to be the one that can solve ambiguity of source text and other sources, this is really nothing more than a very modest effort to model different parsing methods separate from the usual tools.

My gut feeling tells me that while implementing solutions is best served by employing tried and tested tools, coming up with optimal solutions may not always share in those benefits. So in other words, AST's have a way about them that force looking at problems in certain ways, so modeling the problem without is a way to avoid restricting ourselves to the givens of using them.

So this is far from a solution, just an attempt to provide a way to explore solutions, and the bottom line holds, ambiguity is ultimately a source problem, and if it is, then the only way to resolve it is out of band.

@SMotaal
Copy link
Author

SMotaal commented Nov 6, 2018

@devsnek the underlying motivation behind my markup experiment in general is not restricted to JS, in fact, I was interested to find different ways for efficient and responsive multi-syntax parsing without the pitfalls of conventional methods. And on that, I think I am ready to dare make the claim that it can be done with virtually no switching overhead, using less popular features like generators and regexps: html (and script tags)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
brainstorming Safe place to discuss ideas and provide constructive feedback cjs discussion interoperability
Projects
None yet
Development

No branches or pull requests

9 participants