deterministic source archives #2948

Ericson2314 · 2016-08-02T08:14:35Z

Tarballs contain more information than we need (e.g. users, groups, fine-grained permissions, timestamps), and also allows representing the same information in multiple ways (e.g. order of directory contents, files defined twice). The basic problems this creates is that files cannot be deterministically assembled into an archive. In practice this means:

Directory registries cannot be verified against lockfiles as well
Packages may accidentally depend on permissions only supported on some platforms
Sources besides registries cannot be mirrored (distinct from what sorts of sources can serve as mirrors)
Users may unintentionally leak information about their current system when publishing packages.

None of these is terribly pressing on its own, but hopefully they are worthy of a solution in aggregate.

The solution is first carefully deciding which metadata we wish to support---the information our archives will contain, and then picking a canonical form for every possible archive containing that information. A thornier question is whether existing uploads should be normalized according to the chosen schema.

For backwards comparability, it is probably best to stick with some subset tar. This is what Debian does. Where an extraneous field cannot be elided, it should be constrained to some fixed value. Either the most expressive posix tar variant could be used, or the most minimal format that supports the information in question.

Other options might be git's tree objects or Nix's NAR. The Merkle DAG used by the former can lead to better error messages and free dedup, but SHA1 is dubiously secure. The latter can be hashed however we like, but still runs into backwards-compat.

CC @eternaleye

eternaleye · 2016-08-02T15:11:36Z

@Ericson2314 and I discussed this on IRC, and figured that the actual metadata needed is probably (at most) the executable bit. As a result, the v7 tar format (the oldest one, and supported absolutely everywhere) would be a viable option. For determinism, we'd want to constrain:

mode bits 0755 or 0644 (if executable bit not needed, 0644 can be hardcoded)
uid to 0
gid to 0
timestamp to 0
"link indicator" to 0 (normal file)

And sort the file names in a locale-independent manner.

(Note: UID zero would not be much an obstacle to users extracting them, as by default GNU tar does not preserve UIDs at extraction time when executed by an unprivileged user)

However, it has a 100-character limit on path length. The limit can be raised to 255 by moving to the "ustar" format, but that just postpones the issue. For unbounded filenames, we'd need to move to the "pax" format, which is considerably more complex.

However, we could say that pax is only used for crates that actually possess such long paths, and thus postpone the issue until such crates exist, while still being 100% deterministic (as a crate will either have only short paths, and thus be v7, or at least one long path, and thus must be pax).

Another possibility is to use pax from the start, but that will require more thought on how to make it deterministic.

Also, it may be best to exclude the executable bit - nothing cargo does natively needs it, and build scripts can either set it before executing things, or specify the interpreter for scripts. Putting it in the archive would then be unnecessary.

alexcrichton · 2016-08-02T16:08:17Z

I'd be down for this! I'd also prefer to stick with tarballs if we can, I don't think there's any reason per-se they have to be nondeterminsitic.

@eternaleye note that all those tar formats are currently supported by the tar-rs crate, and currently uses the ustar format for backwards compatibility but it's perhaps been long enough now that we can switch to gnu (which is easier to write than pax right now). In any case though it should be easy enough to configure the header there to have whatever data we need, or add a .deterministic() method to headers in tar-rs.

eternaleye · 2016-08-02T16:31:38Z

Well, I'd actually prefer pax to gnu (actually a standard), ustar to either (older standard with more buy-in, simplicity is a benefit in validation, not just implementation), and v7 if we could get away with it (dead simplest).

alexcrichton · 2016-08-02T16:33:46Z

We can't do v7 b/c ustar supports longer paths (what Cargo supports today). Stick with ustar, what Cargo currently does, is fine.

eternaleye · 2016-08-02T16:37:35Z

Makes sense - however, ustar does raise the additional concern of exactly how the path is split between the path and path-prefix fields.

Ericson2314 · 2016-08-02T18:03:49Z

Well the idea would be to use v7 where possible, otherwise ustar, (and maybe in the future for really long names, otherwise pax). But just using ustar should be fine.

On the cargo end, removing the set_metadata might make things work in practice, but we should be careful to define our format so future versions of tar-rs don't inadvertently change it. More interesting to me is sanitation on the crates.io side, @alexcrichton do you know where code for that should go?

Ericson2314 · 2016-08-02T18:29:36Z

Well the idea would be to use v7 where possible, otherwise ustar, (and maybe in the future for really long names, otherwise pax). But just using ustar should be fine.

On the cargo end, removing the set_metadata might make things work in practice, but we should be careful to define our format so future versions of tar-rs don't inadvertently change it. More interesting to me is sanitation on the crates.io side, @alexcrichton do you know where code for that should go?

And should we normalize existing tarballs?

eternaleye · 2016-08-02T18:30:40Z

I'm personally in favor of doing so, as otherwise any integrity system we deploy will work differently for crates uploaded before/after the change.

EDIT: Sadly, #2857 already got merged - as a result, this would be disruptive for anyone who has used local mirrors, because it adds just such an integrity system. Is there any chance that could be backed out, or not included in "stable" cargo releases until this is addressed?

carols10cents added C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` Command-package labels Sep 26, 2017

This was referenced Oct 8, 2022

Problematic file permissions in crates.io tar archive rust-lang/pin-utils#37

Closed

Cargo unpacks files with too restrictive mode (breaking multi-user shared cargo registry) #3442

Open

epage added the S-triage Status: This issue is waiting on initial triage. label Sep 28, 2023

fenollp mentioned this issue Aug 9, 2024

Compile a crate from its source archive directly rust-lang/rust#128884

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deterministic source archives #2948

deterministic source archives #2948

Ericson2314 commented Aug 2, 2016

eternaleye commented Aug 2, 2016

alexcrichton commented Aug 2, 2016

eternaleye commented Aug 2, 2016

alexcrichton commented Aug 2, 2016

eternaleye commented Aug 2, 2016 •

edited

Loading

Ericson2314 commented Aug 2, 2016

Ericson2314 commented Aug 2, 2016

eternaleye commented Aug 2, 2016 •

edited

Loading

deterministic source archives #2948

deterministic source archives #2948

Comments

Ericson2314 commented Aug 2, 2016

eternaleye commented Aug 2, 2016

alexcrichton commented Aug 2, 2016

eternaleye commented Aug 2, 2016

alexcrichton commented Aug 2, 2016

eternaleye commented Aug 2, 2016 • edited Loading

Ericson2314 commented Aug 2, 2016

Ericson2314 commented Aug 2, 2016

eternaleye commented Aug 2, 2016 • edited Loading

eternaleye commented Aug 2, 2016 •

edited

Loading

eternaleye commented Aug 2, 2016 •

edited

Loading