-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deterministic source archives #2948
Comments
@Ericson2314 and I discussed this on IRC, and figured that the actual metadata needed is probably (at most) the executable bit. As a result, the v7 tar format (the oldest one, and supported absolutely everywhere) would be a viable option. For determinism, we'd want to constrain:
And sort the file names in a locale-independent manner. (Note: UID zero would not be much an obstacle to users extracting them, as by default GNU tar does not preserve UIDs at extraction time when executed by an unprivileged user) However, it has a 100-character limit on path length. The limit can be raised to 255 by moving to the "ustar" format, but that just postpones the issue. For unbounded filenames, we'd need to move to the "pax" format, which is considerably more complex. However, we could say that pax is only used for crates that actually possess such long paths, and thus postpone the issue until such crates exist, while still being 100% deterministic (as a crate will either have only short paths, and thus be v7, or at least one long path, and thus must be pax). Another possibility is to use pax from the start, but that will require more thought on how to make it deterministic. Also, it may be best to exclude the executable bit - nothing cargo does natively needs it, and build scripts can either set it before executing things, or specify the interpreter for scripts. Putting it in the archive would then be unnecessary. |
I'd be down for this! I'd also prefer to stick with tarballs if we can, I don't think there's any reason per-se they have to be nondeterminsitic. @eternaleye note that all those tar formats are currently supported by the tar-rs crate, and currently uses the ustar format for backwards compatibility but it's perhaps been long enough now that we can switch to gnu (which is easier to write than pax right now). In any case though it should be easy enough to configure the header there to have whatever data we need, or add a |
Well, I'd actually prefer pax to gnu (actually a standard), ustar to either (older standard with more buy-in, simplicity is a benefit in validation, not just implementation), and v7 if we could get away with it (dead simplest). |
We can't do v7 b/c ustar supports longer paths (what Cargo supports today). Stick with ustar, what Cargo currently does, is fine. |
Makes sense - however, ustar does raise the additional concern of exactly how the path is split between the path and path-prefix fields. |
Well the idea would be to use v7 where possible, otherwise ustar, (and maybe in the future for really long names, otherwise pax). But just using ustar should be fine. On the cargo end, removing the |
Well the idea would be to use v7 where possible, otherwise ustar, (and maybe in the future for really long names, otherwise pax). But just using ustar should be fine. On the cargo end, removing the And should we normalize existing tarballs? |
I'm personally in favor of doing so, as otherwise any integrity system we deploy will work differently for crates uploaded before/after the change. EDIT: Sadly, #2857 already got merged - as a result, this would be disruptive for anyone who has used local mirrors, because it adds just such an integrity system. Is there any chance that could be backed out, or not included in "stable" cargo releases until this is addressed? |
Tarballs contain more information than we need (e.g. users, groups, fine-grained permissions, timestamps), and also allows representing the same information in multiple ways (e.g. order of directory contents, files defined twice). The basic problems this creates is that files cannot be deterministically assembled into an archive. In practice this means:
None of these is terribly pressing on its own, but hopefully they are worthy of a solution in aggregate.
The solution is first carefully deciding which metadata we wish to support---the information our archives will contain, and then picking a canonical form for every possible archive containing that information. A thornier question is whether existing uploads should be normalized according to the chosen schema.
For backwards comparability, it is probably best to stick with some subset tar. This is what Debian does. Where an extraneous field cannot be elided, it should be constrained to some fixed value. Either the most expressive posix tar variant could be used, or the most minimal format that supports the information in question.
Other options might be git's tree objects or Nix's NAR. The Merkle DAG used by the former can lead to better error messages and free dedup, but SHA1 is dubiously secure. The latter can be hashed however we like, but still runs into backwards-compat.
CC @eternaleye
The text was updated successfully, but these errors were encountered: