Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the complete list of requirements for the VFS #28

Merged
merged 18 commits into from
Sep 12, 2022

Conversation

RaisinTen
Copy link
Contributor

@RaisinTen RaisinTen commented Aug 24, 2022

This change attempts to exhaustively document all the requirements of
the VFS. If you disagree with a point or want to add more requirements
to this list, you are more than welcome!

Fixes: #21
Fixes: #18
Signed-off-by: Darshan Sen raisinten@gmail.com

@cspotcode
Copy link

Adding the list of scenarios to evaluate potential VFS implementations against, copied from #18, as requested here #18 (reply in thread)

This list will give us a better idea of which modules can be bundled into an SPA as-is, and which use coding techniques that will need to be modified or special-cased to support the SPA runtime.

Static import
static require
dynamic import
dynamic require
passing argv values to FS calls
working directory is / (crops up in docker workflows)
being able to manipulate VFS paths as strings vs URL objects
executable is responsible for globbing, for example, it is a test runner

Globbing scenario

If I run ./globbing-tool './**/*', will it list the contents of the VFS? If the VFS exists at ./globbing-tool then ./globbing-tool/__vfs__/bin.js does match the glob.

If I run ./globbing-tool './**/* | bash-stuff then bash-stuff might try to read ./globbing-tool/__vfs__/bin.js and fail.

#18 (reply in thread)

If code within the SPA does naive globbing -- using an off-the-shelf glob library -- then it might end up traversing the VFS. This depends where the VFS is mounted and if require('fs') can see the VFS. If require('fs') cannot see the VFS, then that causes other problems or limitations which can certainly be documented.

README.md Outdated Show resolved Hide resolved
Copy link
Member

@ovflowd ovflowd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's just one section I feel it is missing... Something regarding ensuring that the VFS respects the paths defined within the application.

What I mean by this is that by having a piece of code that wants to access a file from the VFS (e.g., A simple JSON file), when SEA is building the binary, it needs to ensure that these paths are transformed correctly.

Not sure if this is covered on "No interference with valid paths in the file system", but I want to ensure that if the scenario where a file exists both in the VFS and real-file-system for the same path (e.g.: ./my-file.txt and if the current working directory also has a my-file.txt) that the VFS is smart enough to distinguish what should be accessed.

Another essential point that I feel is not covered here are FS permissions. Like, how we deal with scenarios where we don't have access to the actual file system or how we deal with insufficient permissions...

Finally, something mentioned here is how we deal with large VFSs as we don't want to deal with the application being slow on its bootstrap and execution because of random file accesses and the need to unarchive the whole VFS...

README.md Outdated Show resolved Hide resolved
@ovflowd
Copy link
Member

ovflowd commented Aug 26, 2022

BTW, may be worth adding a benchmarking sub-section to the "supported" area, as we should investigate the performance under numerous circumstances and file systems and architectures of the performance and reliability of some of the VFSs we want to support. (Either the ones we're creating, or the ones we think about supporting)

@cspotcode
Copy link

Not sure if this is covered on "No interference with valid paths in the file system"

I believe this is what we're thinking about in arcanis's explanation for how they mount zip files and avoid accidental deep directory traversal, and the thread about why mounting at /snapshot is a bad idea.

We could break "no interference" into multiple sub-requirements? Pick scenarios from the list here #28 (comment) appended with both "on the real filesystem" and "on the VFS"

For example:
dynamic require can execute a file on the VFS
dynamic require can execute a file on the real filesystem
readFileSync can read any file on the real filesystem
readFileSync can read any file on the VFS

when SEA is building the binary, it needs to ensure that these paths are transformed correctly.

To me, this implies that all VFS paths are known at build-time and somehow transformed. So dynamic readFileSync, dynamic import, dynamic require might fail at runtime. I've been operating under the assumption that we're not talking about a build-time transformation. If other people are assuming a build-time transformation, e.g. somehow rewriting source-code to transform readFile('./foo') into readFile(new URL('nodevfs://foo)), then we can assign names to the two ideas to make them easier to talk about.

@cspotcode
Copy link

In the spirit of #28 (comment):

Should we also add a doc that explains how each VFS works internally? E.g. a primer explaining the characteristics of a Zip file: table of contents at the end for random access, compression is optional but supported, compression can be enabled per-file, can store symlinks.

Then the same for the other VFS implementations being considered.

Not sure if this already exists and I missed it, or if it should be a separate PR.

@ljharb
Copy link
Member

ljharb commented Aug 26, 2022

They don’t have to be known at build-time - the fs wrapper can provide privileged access inside the SEA to dynamic paths.

@ovflowd
Copy link
Member

ovflowd commented Aug 26, 2022

To me, this implies that all VFS paths are known at build-time and somehow transformed. So dynamic readFileSync, dynamic import, dynamic require might fail at runtime. I've been operating under the assumption that we're not talking about a build-time transformation. If other people are assuming a build-time transformation, e.g. somehow rewriting source-code to transform readFile('./foo') into readFile(new URL('nodevfs://foo)), then we can assign names to the two ideas to make them easier to talk about.

I don't think they need to be known at build-time, I was just thinking something like a Proxy or something that transparently wraps around fs so it operates during runtime resolution. But to be honest, it would be neat to transform static imports.

@ovflowd
Copy link
Member

ovflowd commented Aug 26, 2022

Should we also add a doc that explains how each VFS works internally? E.g. a primer explaining the characteristics of a Zip file: table of contents at the end for random access, compression is optional but supported, compression can be enabled per-file, can store symlinks.

+1

@jviotti
Copy link
Member

jviotti commented Aug 29, 2022

Another essential point that I feel is not covered here are FS permissions. Like, how we deal with scenarios where we don't have access to the actual file system or how we deal with insufficient permissions...

Permissions are usually tied to the group/owner, etc distinction that is specific to one instance of the file system and won't translate to other file systems with a different user setup.

That said, I think there is a subset of permissions that can be preserved. Execution is the main one that comes to mind. Tools like ASAR omit all permissions but the executable bit for this reason. The other aspect would be whether a file is readable or writable, but based on various threads so far, I think we agreed that the VFS is read-only, so we can probably ignore this characteristic.

BTW, may be worth adding a benchmarking sub-section to the "supported" area, as we should investigate the performance under numerous circumstances and file systems and architectures of the performance and reliability of some of the VFSs we want to support. (Either the ones we're creating, or the ones we think about supporting)

Yeah, this is exactly what I'm hoping for. I wanted to collect all the requirements we want first (without being biased too much by existing solutions), and then use that as the basis for checking which VFS out there, if any, checks the boxes or not.

If other people are assuming a build-time transformation, e.g. somehow rewriting source-code to transform readFile('./foo') into readFile(new URL('nodevfs://foo)), then we can assign names to the two ideas to make them easier to talk about.

This sounds very hard to do right, mainly with dynamic requires (which we should support). I didn't study this aspect much, but PKG seems to do it very well, so we should at least try to copy that approach cc @jesec

Should we also add a doc that explains how each VFS works internally? E.g. a primer explaining the characteristics of a Zip file: table of contents at the end for random access, compression is optional but supported, compression can be enabled per-file, can store symlinks.

That would be amazing. Can you open a GitHub Issue for this?

Copy link
Member

@jviotti jviotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We can merge this and send further PRs to clarify later

This change attempts to exhaustively document all the requirements of
the VFS. If you disagree with a point or want to add more requirements
to this list, you are more than welcome!

Fixes: #21
Fixes: #18
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
@RaisinTen
Copy link
Contributor Author

@cspotcode

Static import
static require
dynamic import
dynamic require

I have added a section about this to the docs but I'm not sure what problems we might face while implementing that. Might be better if you could share that info.

passing argv values to FS calls

Could you expand on that?

working directory is / (crops up in docker workflows)

I believe that one is already covered in the "No interference with valid paths in the file system" section.

being able to manipulate VFS paths as strings vs URL objects

Added that in but would like to know if there are any gotchas.

executable is responsible for globbing, for example, it is a test runner

Added a "Globbing" section.

@ovflowd

Another essential point that I feel is not covered here are FS permissions. Like, how we deal with scenarios where we don't have access to the actual file system or how we deal with insufficient permissions...

Are you referring to the implications our VFS might have when nodejs/node#44004 lands?

@ovflowd
Copy link
Member

ovflowd commented Aug 31, 2022

Are you referring to the implications our VFS might have when nodejs/node#44004 lands?

Yea. 😅 (That PR is def a big break-change).

Copy link
Member

@ovflowd ovflowd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@cspotcode
Copy link

@RaisinTen

Thanks, I meant the list as a checklist to evaluate potential VFS implementations. Some are not problems with the current VFS proposals but might be problems with future VFS proposals.

dynamic import
dynamic require

I suppose dynamic is more important than static. If an SPA does require(require.resolve('./tool.config.js')) and wants to execute a file from the real filesystem, it should be able to. And if it tries to require(require.resolve('bundled-internal-dependency')) to execute something from the VFS, then that should also work. Mostly wanted to include this as a reason not to rely on build-time resolution the way that bundlers do.

being able to manipulate VFS paths as strings vs URL objects

If someone proposes that the VFS exist at a vfs-file:// prefix, then this might become an issue. fs APIs accept URL objects, but this means code in (transitive) dependencies which assumes all native paths are strings may fail when passed URL objects. Perhaps a (transitive) dependency uses require.resolve. I'm not sure if require.resolve() is meant to work with URLs.

I can imagine someone proposing vfs-file:// as a solution for "No interference with valid paths in the file system."

passing argv values to FS calls

We can have a couple examples of (contrived) corner-cases such as ./bundled-spa-formatter fmt ./bundled-spa/snapshot/index.js.
Internally, I imagine the formatter will probably do something akin to fs.readFileSync(process.argv[3]), format it, and echo the results to stdout. In this example, process.argv[3] is a path in the VFS. I don't think this is a bad thing -- the example is contrived -- but it might be worth documenting as a well-known quirk of SPAs.

working directory is / (crops up in docker workflows)

Thanks, yeah looks like it's covered.


If the bundled files in the VFS correspond to certain paths that already exist
in the real file system, that will be very confusing, so it should use such
paths that cannot be used by existing files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than being confusing, this would outright break certain use-cases.

@jviotti
Copy link
Member

jviotti commented Aug 31, 2022

If someone proposes that the VFS exist at a vfs-file:// prefix, then this might become an issue.

I think we all agree that we can officially drop that idea, as introducing a prefix would break a huge amount of things.

If an SPA does require(require.resolve('./tool.config.js')) and wants to execute a file from the real filesystem, it should be able to. And if it tries to require(require.resolve('bundled-internal-dependency')) to execute something from the VFS, then that should also work.

I think this is the best summary of the entire problem so far. We need a clear answer to this problem. The scenarios that we need to handle are:

  1. The given file path exists in both the VFS and the outer file system
  2. The given file path exists in the VFS but not in the outer file system
  3. The given file path does not exist in the VFS but exists in the outer file system
  4. The given file path doesn't exist anywhere

All the cases except from the first one are trivial. For the first one, a potentially good-enough way of solving the dilemma is a precedence rule. We i.e. prefer the VFS location over the outer file system or vice-versa.

However, which one we prefer is still a tricky problem, and I can see how it would give us pros and cons both ways.

What do you think? Is there an approach other than precedence we would consider, that does not involve inventing an arbitrary prefix for the VFS?

@cspotcode
Copy link

cspotcode commented Aug 31, 2022

I think we should do what yarn does for zips: the VFS is available on sub-paths of the executable, as if the executable itself were a directory even though it is not.

For example, if the executable is at /foo/packaged-cli-tool then fs.readFileSync('/foo/packaged-cli-tool/nested/path/into/the/vfs/index.js') will work.

The executable's own path is the only location on the entire filesystem which is guaranteed not to contain any directories or files.

@jviotti
Copy link
Member

jviotti commented Aug 31, 2022

Ah, that's brilliant. I love it. I can't think of any drawback to that. @RaisinTen shall we document that approach?

Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
@RaisinTen
Copy link
Contributor Author

@RaisinTen shall we document that approach?

@jviotti does this look okay

It might be better to use the single executable path as the base path for the
files in the VFS, i.e., if the executable has `/a/b/sea` as the path and the VFS
contains a file named `file.txt`, it would be accessible by the application
using `/a/b/sea/file.txt`. This approach is similar to how Electron's [ASAR][]
works, i.e., if the application asar is placed in `/a/b/app.asar`, the
embedded `file.txt` file would use `/a/b/app.asar/file.txt` as the path.
?

@RaisinTen
Copy link
Contributor Author

@cspotcode I think I've addressed all your points too


If a single-executable formatter is run with an argument that is a path to a
file inside the VFS, it should be able to use the `fs` APIs to read, format and
print the formatted contents to `stdout`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section should be rewritten. The goal is not to allow this. The goal is to document this as a known gotcha, so authors of SEAs understand that this might happen unexpectedly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the decision we need to make here is: can files within the VFS be referenced from outside the VFS, or only internally?

If we have an approach that disambiguates files inside and outside of the VFS (like the base path approach), then I don't see why we should not allow this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those fundamental questions, let's try to document them in the README. I think it will be useful to have a concrete trail of fundamental questions & answers somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start GitHub Discussions on a separate category to track these down, and start a PR documenting the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess most people in #18 were in favor of keeping the paths inside the VFS transparent? I have also changed the globbing part to better reflect that.

@jviotti
Copy link
Member

jviotti commented Sep 8, 2022

@RaisinTen Looks good. My only comment is that rather than saying "It might be better...", say "A possible solution...", etc

Signed-off-by: Darshan Sen <raisinten@gmail.com>
Signed-off-by: Darshan Sen <raisinten@gmail.com>
@RaisinTen
Copy link
Contributor Author

@jviotti done, PTAL

Signed-off-by: Darshan Sen <raisinten@gmail.com>
@RaisinTen
Copy link
Contributor Author

@addaleax I have also moved the data compression part to the optional section since you mentioned that during the meeting, PTAL.

@jviotti
Copy link
Member

jviotti commented Sep 9, 2022

@RaisinTen Let's merge it. If there are further comments, we can send further PRs :)

@RaisinTen RaisinTen merged commit eac8775 into main Sep 12, 2022
@RaisinTen RaisinTen deleted the vfs-requirements branch September 12, 2022 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants