-
Notifications
You must be signed in to change notification settings - Fork 44
Tooling: Brainstorming ideas that can lead to efficient loader-oriented designs #203
Comments
ECMAScript modules syntax can arguably be detected using a RegExp which bails on first match. Does anyone have ideas for cjs vs esm syntax detection? |
@SMotaal you can't use regexp to parse js grammar (you can always make a pattern of string literals or whatever to confuse the regexp) and the differences between valid cjs and valid esm are ambiguous and can't be reliably detected by just looking at the code. |
So, can we constructively say that so long as you guard against string hijacking (maybe there is a better term for this), only then can you safely use RegExp? |
@SMotaal I would just use acorn |
This comment has been minimized.
This comment has been minimized.
@devsnek humor me in this effort, consider this both an idea-gathering as well as a team-building exercise. Acron is obviously a great solution, but I am trying to create opportunities for people to talk about the aspects that make this and others such great tools. The notion here is that people might just have some evolving ideas that they might want to bounce around. How we connect the dots, like you pointing out the hijacking limitation can potentially inspire untapped solutions to existing problems. Sounds fair? |
@SMotaal You could say that a file with For example: test = 42; In Script mode, this creates the property |
@targos does the issue get any better if we say that such a loader always imports CJS in strict mode regardless of an explicit "use strict"? |
are we trying to come up with use cases for loaders or something else? if you're using a resolve loader hook you'll always be able to read the contents of whatever you're resolving, at which point you can regex or acorn or whatever it as you see fit. |
I'm having trouble to see the relation between 'cjs vs esm syntax detection" and the OP. Maybe I don't really understand what this thread is about, sorry. |
@targos Actually, I think you are hitting the nail with pointing out that:
So would it be possible to say that when dealing with ambiguous code, syntax-based detection is possible for ECMAScript Modules (ie having those explicit syntaxes Sounds right? |
@SMotaal you could always fall back to your own opinions of what the file should be but its impossible to know the author's intent. i agree with targos that i have no idea what this thread is for. |
This comment has been minimized.
This comment has been minimized.
@devsnek The ideas you are all expressing here are extremely valuable, they allow others to actually learn or at least consider a different perspective. It also makes it easier for people to be able to better appreciate and understand intent in future discussions. I think that the biggest problem is not that people disagree, this is actually not bad, but more so that sometimes we tend to do but end up arguing in two separate directions due to miscommunication and misunderstanding. |
@benjamingr I might be mistaken, but I believe that it is possible to evaluate non-strict code. While I am not certain how |
@targos Until we actually figure out how the Modules WG will handle source ambiguity, it can be helpful to explore (maybe even POC) the various ways to achieve it. Thinking of any of those ideas as either core vs extensions is premature, but that should not discourage efforts of reasoning about it and trying to find ways to refine them irrespective of where those aspects end up. |
Try to keep in mind performance when exploring options for handling
ambiguity, as well as the fact that this space has been explored
extensively during the prior EPS process.
Dual parsing a module was deemed inefficient
…On Wed, Oct 17, 2018, 10:29 AM Saleh Abdel Motaal ***@***.*** wrote:
I'm having trouble to see the relation between 'cjs vs esm syntax
detection" and the OP. Maybe I don't really understand what this thread is
about, sorry.
Until we actually figure out how the Modules WG will handle source
ambiguity, it can be helpful to explore (maybe even POC) the various ways
to achieve it. Thinking of any of those ideas as either core vs extensions
is premature, but that should not discourage efforts of reasoning about it
and trying to find ways to refine them irrespective of where those aspects
end up.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#203 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAecV8LGBqH_HEYEFLdQMt3qtlAgrUsSks5ulz7egaJpZM4Xjw5D>
.
|
@SMotaal so we're discussing how the default loader should handle source ambiguity? |
@MylesBorins I think it may be important to know more about the dual parsing approach. I state this not to suggest that dual parsing is or is not a solution, but rather to see if a different parsing approach may be something worth exploring. From my own research (which I know is relatively limited to other folks in this space), I often find the common pattern of tokenizing into ASTs, which in many cases seems to be an eagerly contiguous process, which makes sense for many things, especially for transforms. In contrast loader-first tokenization (AST or not) may be more efficient if it bailes out on the first conclusively deterministic feature, and more so if it is possible to have a non-binary intent which would allow a single scan to be used. Can you shed some light on the methodology? (maybe a link to follow-up) |
@devsnek I think of this as a parallel discussion altogether, not intended to directly affect other discussions that deal with the specifics of the default loader... etc. That said, there is no harm if we end up drawing some conclusions that positively influence our process in general. |
@MylesBorins the inefficiency is tolerable as @jdalton shows with a top level parse, which would be much faster though if v8 directly supported such mechanisms. However, as the language increases, there are a few concerns: Some heuristics may fail/be unreliable as features get added to different modes:
We could probably think of more as we desire, but the idea of what to do in ambiguous cases seems a bit beyond scope of tooling itself, these would need to be definitive answers that we can provide a direct answer to as they come up. @SMotaal Per the question about how does the current loader load non-strict code. It uses multiple Source Texts, it does not create a single string that has both Module and CJS code. You cannot inline a sloppy source text into ESM without using |
Things that are possible:
Agreed with above - there's things that are really hard to figure out and "run CJS with implicit strict" might work for app code but not for dependencies. We tried. It breaks with things like: if (cond) {
/* [...] */
function myHelper(el) { /* [...] */ }
someArr.forEach(myHelper);
} The above will throw in strict mode IIRC and this pattern does appear in real (popular) npm modules. |
@bmeck Absolutely… I was inspired by this approach in the early days of @benjamingr does that align with the concerns you raised? |
@SMotaal if you search this repo for “unambiguous grammar” or “unambiguous syntax” you'll find lots of discussion on this topic. The webpackage idea from @jkrems does give me one idea though: what if an |
Side note: I also dislike that a single |
The load order is still required even with phases, if one phase loader always guess the type to be
I don't understand how this relates, like I said, any supported format that Node can link into a graph works. This is unrelated and doesn't need a separate phase. Even with an
I don't understand this. Webpackage could support |
But why would it be limited to |
They do fix ordering for unrelated concerns, like fetching a resource and actually interpreting it.
So the disagreement is if |
I'm saying |
It isn't? It accepts the full specifier and id of the module loading some dependency, I would expect there to be no constraints except that the id should be unique, and the specifier is a string. |
Ah, I misread your example code. My bad.
Yes, but turning it into a supported Module type doesn't necessarily mean turning it into a supported module type source code. E.g. for WASM (or for the JVM bytecode example actually), you would realistically analyze/compile the resource content first to determine the interface, then generate a facade, and then expose the compilation result inside of the module. Trying to inline the original bytes in the source text and then recompiling on execution would be fairly inefficient and in some cases not practical. The only alternative I can think of is globals and unique ids but that's not really a proper solution. For me CoffeeScript isn't the target I'd want to optimize for. If what your loading can easily be converted into self-contained JS code on the fly, it might just as well have been compiled ahead of time. The same isn't true for things that do not compile to JS and have different execution semantics. One example would be importing a DLL for example. |
We already have an example that doesn't do inline based transformation, currently our CJS translator is creating a separate module record and just loading in the CJS without inlining it. It doesn't recompile on execution at all currently.
Modules will need unique ids anyway in order to ensure the
It isn't just CoffeeScript that does JS compilation; historically code coverage has done this (no longer!!!), eslint certainly could be useful to enforce at boot time, development runs without having 2 commands for build vs run, etc. It certainly isn't the only thing we should optimize for, but it is part of it. If the concern is mostly around avoiding duplicate parse/eval phases that is something we can design around, but I don't see how |
|
Correct. It currently doesn't, but if we wanted to we could rewrite it to do so. I'm not sure if that information is for or against anything given that.
If you follow the Realms proposal there are fewer JS APIs being considered and most interactions for things are being moved to be purely string based. I'm not sure what APIs are being talked about here. |
As I catch up on this thread, I am appreciating how everyone tries to follow a more brainstorming approach to allow everyone to pose ideas to see how they materialize (or not) later on. I think this type of discussion helps people with very diverse backgrounds, experiences, and extents of familiarity with the intricacies of ESM and CJS to mutually share and gain insights that are sometimes missed during goal-oriented debates. |
On the idea of top-level parsing to disambiguate JS sources. I took some time to put together an experiment to roughly demonstrate the relative costs associated with different parsing strategies. The gist of it is that a parser would bail out at the first occurence of a particular syntax, where it will parse through the entire file length otherwise, using as little grammars as possible for a safe parse. The current experiment does not bail out, it simply identifies escapable entities that can be used for hijacking, contextualizing symbols, and the set of keywords that would satisfy the condition. I added new parsing modes to the experimental parser "esm", "cjs" and "esx". In "esm", the parser will operate in strictly top-level and only look for the keywords The demo page is served from Demo: acorn.mjs
Demo: acorn.js
Obviously this does not address disambiguation of ambiguous source texts. If relative performance gains can be further improved or optimized, then disambiguation (loader or not) by source text will be something will likely be favoured by some down the road. |
@SMotaal That’s a great start for something that I can see as a loader. For your CommonJS detection I would add a check to look for globally-referenced Perhaps it would be good to start compiling a list somewhere of things that people might want to see as loaders. Besides this case, off the top of my head there’s transpilers, automatic completion of file extensions/folder root files, configuration of module loading behavior based on file extension, and general backward compatibility to bridge the gap between what will be possible in ESM in Node and what is/was possible/allowed in Babel and other transpiled/built versions of ESM. |
@GeoffreyBooth obviously my efforts are gearing towards complimenting any potential implementations for loaders once the design process matures, but for this particular experiment, I decided to isolate for any such efforts and instead tried to focus on some proxy problems. In my effort, thinking of syntax highlighting was a great way to visually solve parsing challenges, and parsing in the main thread without dependencies was a great way to address performance issues, adhering to generators was a great way to force a stream-like approach... the list goes on. Regardless, it was just perfectly timed to use it to demo relative performance gains compared to more common AST all the things then do one small thing, which would be rather expansive for esm vs cjs (or my proposed esx) detection in my best estimate (but still needs real world benchmarks). |
ESX parsing currently scans the full length of source text, but the intent is to actually keep reference of enclosing ranges and not analyze them unless there are no signals of ESM syntax on top-level, then finally scan enclosures to find the first cjs hint or not, this makes it possible to report ESM, CJS, or still ambiguous so use the default based on out-of-band settings... etc. |
If we have out of band settings, and that info conflicts with a parsed result, I’d expect it to throw - the two shouldn’t be in disagreement. |
That’s actually a very important aspect, because I in my rushed vision of eliminating parsing errors which are handled normally by the runtime, I have not given thought to certain errors that belong specifically to the intent at hand. More of this kind of insights here can go a long way down the road when making decisions. Awesome 🙂 |
This comment has been minimized.
This comment has been minimized.
When considering the case of parsing, I was having trouble mentally placing the metadata communicated between two loaders for instance. In this case, it is in-band (imo) but it is not "directly" from source, it is inferred and attributed to the source text, and is triggered (or bypassed) and responds to out-of-band (one-to-many) and out-of-source (one-to-one) aspects or settings. Can I propose the following complementary pairs: (examples in brackets)
Can anyone find a more practical breakdown of such information regarding a source text's journey? This is all crude thoughts, it needs magic from the group. I feel that a distinction between what maps to sources and what is specific to a source but not baked right into the body are essential distinctions. |
I finally updated the README and pushed the revisions made last week. Timing is more accurate now. I also converted the rendering pipeline to async APIs. Tokenization APIs remain sync but use generators so they yield and return as needed. I improved the modes for I am really interested to hear some feedback on the three modes ( |
@SMotaal its cool i guess? i don't really understand why we have an issue open for it though. |
@devsnek This thread is about ideas in general separate from implementation. As we move closer to loaders and defaults, those discussions and demos can be helpful, at the very least, they can serve as a reference for those who need to find more about them. |
@jdalton Can you pitch in on the idea of syntax detection relating to top-level parse. I tried to find a way to model this to the benefit of everyone in the group and was able to show a 200% increase in performance (theoretical) relative to the same method to full ES grammar parsing like ASTs would. This was done avoiding the conventional all-or-nothing AST approach, using half-way optimized RegExps addressing usual concerns like hijacking. Ideas like dual-parsing (@MylesBorins) and your top-level parse (@bmeck) made me think of a single-parse limited to the minimal subset of both grammars and it was roughly capped at 175% depending on nested complexity but on average better than 150%. Since we're trying to find the first clue to determine syntax, the expectation is that such clues will often materialize early on in a text, making it reasonable to bail or delay the rest of the parsing (if at all needed). Can we hash out pseudo code for syntax determination based on your initial thoughts on top-level parse? About this thread…I'm trying to brainstorm ideas parallel to our implementation efforts that make it possible for our broadly diverse members to appreciate the various technical challenges associated with decisions we are making. Based on an early digest of this discussion, which I took liberty to summarize at top. I tried to pick ideas which seemed to create rifts in discussions elsewhere, mainly in on the topics of syntax detection and interoperability. |
@SMotaal This is impressive . . . just to understand what you’ve done here, is your goal to determine parse goal by analyzing the syntax? A.k.a. a real implementation of the “unambiguous syntax”/grammar that we’ve been discussing? If so, and assuming that you find an algorithm that works, have you thought about how to address the related concerns listed in #150 (comment)? |
to be clear, it's just a lightweight way of parsing js. this doesn't make the ambiguity go away. |
Confirmed; there does not exist any approach based on parsing that is unambiguous in all cases, absent a language spec change. |
Yeah, while I would love to be the one that can solve ambiguity of source text and other sources, this is really nothing more than a very modest effort to model different parsing methods separate from the usual tools. My gut feeling tells me that while implementing solutions is best served by employing tried and tested tools, coming up with optimal solutions may not always share in those benefits. So in other words, AST's have a way about them that force looking at problems in certain ways, so modeling the problem without is a way to avoid restricting ourselves to the givens of using them. So this is far from a solution, just an attempt to provide a way to explore solutions, and the bottom line holds, ambiguity is ultimately a source problem, and if it is, then the only way to resolve it is out of band. |
@devsnek the underlying motivation behind my markup experiment in general is not restricted to JS, in fact, I was interested to find different ways for efficient and responsive multi-syntax parsing without the pitfalls of conventional methods. And on that, I think I am ready to dare make the claim that it can be done with virtually no switching overhead, using less popular features like generators and regexps: html (and script tags) |
Having both ecmascript-modules and @jkrems hackable loader has opened up tremendous scope for experimentation.
Note: This thread does not make claims for or against existing tooling, some of which have stood the test of time, evolved, and are fixtures of the ecosystem. The intent is simply to consider different perspectives being explored in experimental efforts.
As far as things go, the broad range of tooling that applies to loaders basically iterates over productions in each source, irrespective of the specifics of implementation or operations.
Most tools are designed to be used for much more complex applications than merely loading. To that effect, they often avoid the use of new language features that would prevent them from working on older platforms. They can also avoid new features which may have been prematurely associated with inefficiencies in early stages. Some are also built with infrastructures or features that are not ideal or not optimized specifically for loading, like using workers, verbose error checking (ie as a language service)... etc.
I would like to dedicate this thread to brainstorming experimental or just different ideas to implement related patterns for loader-first designs.
How to contribute
Please avoid emoting that can be confusing (especially if it can construed as passively aggressive)
Read the Digest
The following is a set of ideas or conclusions curated from the discussions:
Syntax Detection (CJS vs ESM)
Safely using RegExp — @SMotaal
Fallback for ESM without
import
andexport
— @targosimport(…)
to resolve ambiguity — @bmeckimport.meta
— @bmeckDual parsing a module was deemed inefficient — @MylesBorins
Syntax Identification (CJS vs ESM)
Mime type meta data via something like webpackage — @jkrems
package.json
— @GeoffreyBoothMagic bytes — @jkrems
Wrapping CJS in an ESM module system
The text was updated successfully, but these errors were encountered: