Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esm: support loading data: URLs #28614

Closed
wants to merge 1 commit into from
Closed

esm: support loading data: URLs #28614

wants to merge 1 commit into from

Conversation

bmeck
Copy link
Member

@bmeck bmeck commented Jul 9, 2019

This PR allows loading some formats of modules as data: URLs. This matches the web spec and has some concerns as we grow what can be loaded into ESM as per the open discussion at nodejs/security-wg#520 . I'm opening this with expectation of some discussion to take place around MIME parsing (which is left in #21128) which this PR doesn't properly do and what to do in the cases where module formats are not supported outside of file contexts. Currently I simply didn't include the CJS MIME or the C++ Addon MIME for what can be loaded via data: URLs. We could expose those but likely things would be awkward for things like __filename and that C++ addons would have to be written to disk first for dlopen to work. Once we resolve that we can write up docs on the decision and move this PR forward.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines

@bmeck bmeck added the esm Issues and PRs related to the ECMAScript Modules implementation. label Jul 9, 2019
@bmeck
Copy link
Member Author

bmeck commented Jul 9, 2019

cc: @nodejs/modules

Copy link
Contributor

@jkrems jkrems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I think this would be great to have.

Currently I simply didn't include the CJS MIME or the C++ Addon MIME for what can be loaded via data: URLs.

I think that's reasonable, especially at first. The require loader isn't protocol aware so it would be
awkward to support file types that are "require-native" here.

lib/internal/modules/esm/default_resolve.js Outdated Show resolved Hide resolved
lib/internal/modules/esm/translators.js Outdated Show resolved Hide resolved
@mscdex mscdex added the wip Issues and PRs that are still a work in progress. label Jul 9, 2019
@devsnek
Copy link
Member

devsnek commented Jul 9, 2019

this is very exciting :) i had been planning to open something like this when mime parsing landed. as long as this doesn't use proper parsing, it should probably be flagged.

if (parsed.protocol === 'data:') {
const [ , mime ] = /^([^/]+\/[^;,]+)(;base64)?,/.exec(parsed.pathname) || [ null, null, null ];
const format = ({
'text/javascript': 'module',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text/javascript doesn't necessarily mean Module, it could also mean Script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under no JS loading spec does Script get checked against the MIME text/javascript. Script is effectively without a MIME and this table matches web standards.

Copy link
Contributor

@jkrems jkrems Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we support (browser-style) Script anywhere, so it would feel weird to do that here. A CommonJS script would be something like text/vnd.node.js according to nodejs/TSC#371.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can application/node be added to this object?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression though — /^([^/]+\/[^;,]+)(;base64)?,/ seems to assume the presence of mime type before the [;,] — regardless of if it is mandatory, it might make sense to test it against long and malformed urls to potentially refine it if necessary.

Sorry for not wanting to muddy this with a bad attempt to wing it here, but I will try to locate the ones I worked on a while back for that very same purpose if it helps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SMotaal we could, but i can't think of much we could do except limiting the size? Right now this lacks a variety of things, including MIME parameter parsing and the PR for parsing MIMEs is stuck.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is anchored, it is certainly possible to add efficient guards in the current expression. I'd like to take on exploring how we can do that here, which is mainly just to carve a limited allowed chars when delimited per spec (I did this a while back just need to dig).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmeck I looked into the various options for the expression and recommend:

/^(?:((?:text|application)\/(?:[A-Z][-.0-9A-Z]*)?[A-Z]+)((?:;[A-Z][!%'()*\-.0-9A-Z_~]*=[!%'()*\-.0-9A-Z_~]*)*)(;base64)?),/i

This would match any text/ and application/ subtype, along with the attribute-value parameters like charset= (to be parsed separately), and optional base64 (captured separately from previous parameters).

For now, simply being more restrictive of the character ranges for greedy * and + captures is likely all we need to avoid unpredictable performance hazards with very long crafted/malformed strings.

See gist for more details.

Please let me know how to proceed, if this is worth incorporating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just post a suggestion change and that looks fine

@benjamingr
Copy link
Member

I am +1 on this change and code looks fine. I am not LGTMing this because tests are missing and two LGTMs mean this could land.

I'd also like to echo Jan's statement (about the CJS and addon mime types):

I think that's reasonable, especially at first. The require loader isn't protocol aware so it would be

@guybedford
Copy link
Contributor

guybedford commented Jul 10, 2019

To try to understand how import resolves within data: URLs, does this seem right:

  • data:text/javascript,import './x' would always fail.
  • data:text/javascript,import 'x' would always fail.
  • data:text/javascript,import 'fs' works.
  • data:text/javascript,import '/path/to/file.js' would always fail.
  • data:text/javascript,import 'file:///path/to/file.js' works.

I must admit I'm a little worried about the security model implications, especially around systems that might want to restrict import access. Eg if we have packages on npm that use a import('data:text/javascript,import "' + pathToFileURL(resolve('./dep.js')) + '"') approach then we may struggle to do any type of import restriction in future while retaining working code. Also how does this interact with policy integrity?

In addition, what are the driving use cases for this work?

@bmeck
Copy link
Member Author

bmeck commented Jul 10, 2019

data:text/javascript,import './x' would always fail.

yep. ./x relative to the data URL doesn't make sense and fails since data is not a special scheme. Same as running new URL('./x', 'data:...').

data:text/javascript,import 'x' would always fail.

yep. x has no relative lookup space (node_modules is on a different scheme/context) to resolve its bare name. we could provide ways to populate the lookup space in a way similar to import maps or using loaders.

data:text/javascript,import 'fs' works.

yep. even though we could ban these, it doesn't seem like a big difference given how many evaluators there are with the ability to get fs etc.

data:text/javascript,import '/path/to/file.js' would always fail.

/path/to/file.js is probably incorrect, /path/to/file.js doesn't point to files on the data: scheme, same as if you tried to use that specifier on http it wouldn't point to a file on disk. This mimics running new URL('/path/to/file.js', 'data:...').

data:text/javascript,import 'file:///path/to/file.js' works.

yup, that can be resolved since the specifier is an absolute URL.

I must admit I'm a little worried about the security model implications, especially around systems that might want to restrict import access. Eg if we have packages on npm that use a import('data:text/javascript,import "' + pathToFileURL(resolve('./dep.js')) + '"') approach then we may struggle to do any type of import restriction in future while retaining working code. Also how does this interact with policy integrity?

Policy integrities are how people should restrict import access. We have a variety of evaluators that can access powerful APIs: eval, Function, AsyncFunction, vm.*, worker_threads, etc. . This is just another form of evaluator, the only way to prevent evaluators from accessing powerful APIs would be to introduce mechanisms to prohibit access to those APIs.

The policy files currently are meant to prohibit loading resources that are not whitelisted through module loaders. A user would need to whitelist the data URL and the file URL to dep.js for that to work. Exceptions allowing universal importing are not currently supported by --experimental-policies, nor is the ability to interactively whitelist resources while running. Both would alleviate usability in cases where users are sure they want to allow some resources to have that level of power.

In addition, what are the driving use cases for this work?

Cross environment compatibility and runtime generation of modules in the main module map (with limits similar to browsers). A variety of uses of data URLs are possible, including but not limited to creation of shared module namespaces keyed by strings allowing things like modules to share a communication channel without directly needing to know where the other is.

@jkrems
Copy link
Contributor

jkrems commented Jul 10, 2019

This is just another form of evaluator, the only way to prevent evaluators from accessing powerful APIs would be to introduce mechanisms to prohibit access to those APIs.

That is actually an interesting point - should a policy that prevents eval also prevent import of data URLs? Or is "an untrusted string is passed into dynamic import" already considered insecure enough to make this not matter?

EDIT: I assume that integrity checks or other "allowed module sources" would be the more appropriate policy as opposed to throwing this in with eval.

@bmeck
Copy link
Member Author

bmeck commented Jul 10, 2019

@jkrems alternative works like Trusted Types are being looked at for ways to label "strings" as trusted for evaluators and has some agenda items at this month's TC39 meeting. We could add whatever policies people want to --experimental-policies but I do not consider that in scope for this PR. I would note that if we push things unflagged though, the default values for policies will have to deal with backwards compatibility constraints.

doc/api/esm.md Outdated Show resolved Hide resolved
@jkrems jkrems added the modules-agenda Issues and PRs to discuss during the meetings of the Modules team. label Jul 15, 2019
@bmeck bmeck changed the title DO NOT MERGE esm: support loading data: URLs esm: support loading data: URLs Jul 15, 2019
Copy link
Contributor

@guybedford guybedford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a -0 on this to be completely honest. I think data: URIs are slow and have lots of unfortunate edge cases including security / instancing concerns, and given that loaders don't provide any source hooking mechanism currently, will effectively be the only way to do that providing a pretty bad user experience on that front as well.

At the same time I can appreciate having more web features too.

I'm hopeful we can get more movement on the loader side though soon.

if (parsed.protocol === 'file:') {
return readFileAsync(parsed);
} else if (parsed.protocol === 'data:') {
const [ , base64, body ] = DATA_URL_PATTERN.exec(parsed.pathname);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the case where the regex here fails parsing? (eg data:/)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would error, that invalid URL would have to be returned by a loader though and would already be error checked presumably by the attempted parse in
https://github.com/nodejs/node/pull/28614/files/95eab8ae6b9f5c13dbba10b326ba0e41c12b2274#diff-a7c0a5f3e4fc8503fefebbe82071bc38R48

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the .exec returns null, the destructuring will throw I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll add some checks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in df4dab6

lib/internal/modules/esm/translators.js Outdated Show resolved Hide resolved
@bmeck
Copy link
Member Author

bmeck commented Jul 22, 2019

I don't see much requests for changes and i think most of the implementation stuff is ironed out so I'm removing WIP

@bmeck bmeck removed the wip Issues and PRs that are still a work in progress. label Jul 22, 2019
@GeoffreyBooth
Copy link
Member

To Guy's point, what's the use case for this other than parity with the Web? I'm familiar with data URIs for inserting images into the DOM and things like that. Expanding the scope of JavaScript libraries that can run in both Node and browsers is its own justification, of course, but does this enable any concrete use cases specific to Node?

@bmeck
Copy link
Member Author

bmeck commented Jul 22, 2019

@GeoffreyBooth

Thats a pretty open ended question, and data: is a kind of content addressable storage. A variety of possible things could happen if Node were to expand support for data: including things like support from the command line that unlike -e and -p does mimic the environment from a module instead of altering various variables to be special cased. Support for directly loading in various APIs like Workers without eval: true which also is a special cased scenario. These URLs in theory could also be supported by the fs API since fs accepts URL objects, but I doubt we would do so. Other cases like generating source texts on the fly for loading are discussed above by @jkrems . Other cases such as creating a shared global store are also mentioned above. We have a variety of use cases that might be appealing in addition to mere browser support. I however am only PR'ing the browser support level of implementation in this PR, but others can be free to add support for more use cases easily if we have some level of support for data: already.

@ljharb
Copy link
Member

ljharb commented Jul 22, 2019

None of the use cases this PR enables seems to be a new capability, and it seems to me like they can all be better accomplished with existing means. If there’s capabilities to add in the future, perhaps we should wait until then to add the basic support? Perhaps there’s a better way to add those capabilities?

Adding in a new source of eval to node merely because browsers - which don’t have the power and flexibility of a filesystem available to them, and thus have no better option - doesn’t seem like a good addition to the platform to me.

@bmeck
Copy link
Member Author

bmeck commented Jul 22, 2019

None of the use cases this PR enables seems to be a new capability, and it seems to me like they can all be better accomplished with existing means. If there’s capabilities to add in the future, perhaps we should wait until then to add the basic support? Perhaps there’s a better way to add those capabilities?

These are a different set of capabilities around location being accessed/addressed by content. That on its own is a new capability. In addition this adds compatibility and I do not see a claim of this as being a negative. This is a standardized way of achieving a feature, adding features ad-hoc for specialized circumstances might be useful, but that is not what this PR seeks. It is adding the basic feature that is suitable for a variety of means in a min-max scenario of support just focused on non-specialized loading. If other means can solve use cases with additional APIs that are more tailored to the use case, that seems fine in other PRs.

Adding in a new source of eval to node merely because browsers - which don’t have the power and flexibility of a filesystem available to them, and thus have no better option - doesn’t seem like a good addition to the platform to me.

New sources of evaluation are being added consistently over time to JS the language, Node's runtime, and through new sources like WASM, or even by people implementing them in the ecosystem. I do not understand the argument here vs all those other places that continue to introduce them. If you can explain why having compatibility is a net negative without placing a claim that allowing evaluators are needing to be stopped as a whole needs to be mitigated that would be helpful. As it stands I don't see a clear opinion on why adding this feature is seen as harmful compared to others we continue to add either by choice such as through Worker or by languages such as JS and WASM adding them.

@ljharb
Copy link
Member

ljharb commented Jul 22, 2019

Evaluating a file (a lintable, testable, statically verifiable file) is an entirely different thing than evaluating a string.

@bmeck
Copy link
Member Author

bmeck commented Jul 22, 2019

@ljharb Worker allows string based eval, as does WASM allow eval through buffers of bytes, as does new JS like AsyncGeneratorFunction, as do a variety of modules off npm. I don't see how this is different from those cases and others that continue to be added.

@ljharb
Copy link
Member

ljharb commented Jul 22, 2019

Just because mistakes are being made elsewhere doesn’t mean we have to repeat them.

@bmeck
Copy link
Member Author

bmeck commented Jul 22, 2019

@ljharb the claim that others are making mistakes is not one held by all here. I do not see them as mistakes. Evaluators on their own are not problematic is a generally held and stated stance by TC39 meetings when discussing things, if you could explain why this is a problem that should be prevented here in a way that does not call it a vague mistake that would be helpful. In particular, the ability to access powerful APIs that might grant authority that is unwanted is not prevented by preventing evaluators is often a point made as to why evaluators themselves are not problematic. Access to powerful APIs is possible without evaluators and evaluators can be recreated in userland as we saw with various technologies like ASM.js and now WASM becoming popular ways to do so.

Copy link
Contributor

@sam-github sam-github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look because of the request for sec-wg review. I'm not sure what the data: scheme does, and the PR and docs seem to require pre-existing familiarity. There is only one example, and its an example of something that doesn't work. Some examples of what does work and a description of what the purpose
and function of data: would be helpful.

doc/api/esm.md Outdated Show resolved Hide resolved
doc/api/esm.md Outdated Show resolved Hide resolved
@bmeck
Copy link
Member Author

bmeck commented Jul 22, 2019

@sam-github

I'm not sure what the data: scheme does, and the PR and docs seem to require pre-existing familiarity.

data: URLs were originally specified in RFC2397 as a means of specifying the location of a resource by its content + format. In modern tools such as browsers, SVG viewers, etc. they are used for similar purpose generally/increasingly adhering to WHATWG Fetch standardization of a few differences from the original RFC. Do you think we should link to those 2 pages?

Of note, this scheme by being content-addressed does allow for creating resources at runtime. This combined with the nature of ES Modules being a shared global (per V8 Context/Realm) namespace of modules means that recreating a data URL is tantamount to giving a communication channel between modules. This is creation of shared communication is already true for file: and is used for singleton module but of somewhat a different mechanism since it has to go through the fs the first time it is loaded.

Some examples of what does work and a description of what the purpose
and function of data: would be helpful.

Embedding practical examples into the docs might be a bit unruly, perhaps I can explain them in english for now and we can discuss them as examples that way. Due to encoding and needing to be a string many things are not suitable for small scale examples.

All examples would roughly take the form of:

const txt = 'some.javascript()'; // could also be a big buffer of WASM
const src = Buffer.from(txt).toString('base64');
const mime = `application/wasm` || `text/javascript`; // pick based upon txt format
const url = new URL(`data:${mime};base64,${src}`)
import(url);

For example purposes I will just be showing the txt as if it were a file to be encoded by the above.

  1. Avoiding extraneous loading

When loading some things from memory, such as larger files as WASM avoiding creating a second WebAssembly.Module can be achieved through data: in a way that is not present to WebAssembly. instantiate. While data is not the most effective means of achieving this, it is one way.

This same purpose can be applied to some forms of JS singletons. Take a generic log sink.

export const log = [];

Then could be imported in various places, pushed to, and drained without ever coordinating file paths. Using eval/Function/etc. would create new Arrays and would not fill this use case.

  1. Creating shared namespaces

By avoiding extraneous loading we create a singleton of a Module based upon its content. Some polyfilling/framework mechanisms wish to avoid having incompatible versions/duplicates and could use this as a communications channel to see if multiple versions are loaded if they agree on a content-addressed store to check things. Other use cases like modules not wishing to rely on things like environment variables this could be used for configuration purposes.

The example above with a log is an example of this.

  1. Avoiding fs

Hitting the fs can be a costly thing, inlining modules where it makes sense can see some potential gains (as seen in Yarn2, tink, some cloud function based runtimes, etc.). This can also be achieved to some extent by using tools for bundling as well. data: is not a full solution for this use case as it cannot represent some things like cycles on its own.

  1. Ability to refer to evaluable source text

Currently the evaluators often used are not able to be given a unique location, which makes them ill suited to begin to apply things like policies to. By using an absolute location we can achieve the ability to configure these URLs in a manner that is not suitable for things like eval or Function. You could imagine adding policies to constrain APIs within these URLs via mechanisms like #28767

@sam-github
Copy link
Contributor

The example is very useful, as is the enumeration of use-cases, thank you.

Are the supported MIME types doced elsewhere?

If I understand you correctly, in your some.javascript() example, that would be a module that does something at import, but doesn't export anything? And its a singleton, so if a different module did import() again, some.javascript() would not occur?

And thus your const export log = [] example, if that was the JS content, any module importing textually identical data would be able to share the same log array?

The comparison I guess is by exact bytes of uninterpreted data?

I think minimal docs would an example, and a link to the relevant specs so users can find more. We don't need to inline all the JS standard docs into our own, but if our docs don't point to more information they can become basically incomprehensible.

The use-cases are perhaps too specfic to add to the docs and people can follow the YAML annotations to find the PR and this discussion. But then again, if there are things people should know specific to using this mechanism in node (perhaps security pitfalls to avoid?) then its good to document it.

Shouldn't there be YAML change annotations added, since its a new feature? I know these are not exactly APIs and the doc format is different from the rest of the API docs, but knowing what Node.js version data: is available from would be helpful, I think.

@bmeck
Copy link
Member Author

bmeck commented Jul 22, 2019

Are the supported MIME types doced elsewhere?

Not currently, I can add them. This PR supports text/javascript (as Module), application/json, and application/wasm.

If I understand you correctly, in your some.javascript() example, that would be a module that does something at import, but doesn't export anything? And its a singleton, so if a different module did import() again, some.javascript() would not occur?

Correct, only 1 evaluation of the module would occur.

And thus your const export log = [] example, if that was the JS content, any module importing textually identical data would be able to share the same log array?

Correct.

The comparison I guess is by exact bytes of uninterpreted data?

Yes, in the URL, not in the body of the source text. Changing a MIME would change the "location".

I think minimal docs would an example, and a link to the relevant specs so users can find more. We don't need to inline all the JS standard docs into our own, but if our docs don't point to more information they can become basically incomprehensible.

I agree to some extent, I could link to MDN which is more user friendly than the specs though. I'd prefer that.

But then again, if there are things people should know specific to using this mechanism in node (perhaps security pitfalls to avoid?) then its good to document it.

There isn't anything specific to Node that I know of at this time.

Shouldn't there be YAML change annotations added, since its a new feature? I know these are not exactly APIs and the doc format is different from the rest of the API docs, but knowing what Node.js version data: is available from would be helpful, I think.

Can do.

@sam-github
Copy link
Contributor

Thanks, and I agree, MDN is a better link destination.

@bmeck bmeck added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Aug 13, 2019
@Trott
Copy link
Member

Trott commented Aug 13, 2019

This needs a rebase and a CI run.

@Trott Trott removed the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Aug 13, 2019
@bmeck
Copy link
Member Author

bmeck commented Aug 19, 2019

rebased

@nodejs-github-bot
Copy link
Collaborator

@bmeck bmeck added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Aug 19, 2019
@bmeck
Copy link
Member Author

bmeck commented Aug 19, 2019

None of the CI errors seem related to this PR.

@nodejs-github-bot
Copy link
Collaborator

@Trott
Copy link
Member

Trott commented Aug 20, 2019

None of the CI errors seem related to this PR.

You can use "Resume Build" on the left hand side to only re-run the platforms that failed so you can check. (I just clicked it, so it should post here any second now....)

Co-Authored-By: Jan Olaf Krems <jan.krems@gmail.com>
bmeck added a commit that referenced this pull request Aug 20, 2019
Co-Authored-By: Jan Olaf Krems <jan.krems@gmail.com>

PR-URL: #28614
Reviewed-By: Jan Krems <jan.krems@gmail.com>
@bmeck
Copy link
Member Author

bmeck commented Aug 20, 2019

Landed in 9fd9efa

@bmeck bmeck closed this Aug 20, 2019
BridgeAR pushed a commit that referenced this pull request Sep 3, 2019
Co-Authored-By: Jan Olaf Krems <jan.krems@gmail.com>

PR-URL: #28614
Reviewed-By: Jan Krems <jan.krems@gmail.com>
@BridgeAR BridgeAR mentioned this pull request Sep 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. esm Issues and PRs related to the ECMAScript Modules implementation. modules-agenda Issues and PRs to discuss during the meetings of the Modules team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.