Enable zstd:chunked support in containers/image #775

giuseppe · 2020-11-19T12:09:53Z

a bunch of new APIs to enable zstd:chunked support in containers/image.

add a new custom variant of the zstd compression that permits to retrieve each file separately.

The idea is based on CRFS and its stargz format for having seekable and indexable tarballs.

One disadvantage of the stargz format is that a custom file is added to the tarball to store the layer metadata, and the metadata file is part of the image itself. Clients that are not aware of the stargz format will propagate the metadata file inside of the containers.

The zstd compression supports embedding additional data as part of the stream that the zstd decompressor will ignore (skippable frame), so the issue above with CRFS can be solved directly within the zstd compression format.

Beside this minor advantage, zstd is much faster and compresses better than gzip, so take this opportunity to push the zstd format further.

The zstd compression is supported by the OCI image specs since August 2019: opencontainers/image-spec#788 and has been supported by containers/image since then.

Clients that are not aware of the zstd:chunked format, won't notice any difference when handling a blob that uses the variant.

Since access to host files is required for the deduplication, the untar doesn't happen in a separate process running in a chroot, but it happens in original mount namespace. To mitigate possible breakages, openat2(2) is used were available.

More details here: containers/image#1084

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

rhatdan · 2020-11-19T15:03:26Z

Lint is not happy.

giuseppe · 2020-12-01T13:18:37Z

could I get some feedbacks on the containers/storage side (especially about the new interfaces)?

The current implementation is feature completed, with some parts that can be added later, e.g. cache for host files.

One thing to address is how to store the layer metadata. I think it will be better to use a separate file. [Addressed in the last version]

cgwalters · 2021-05-06T14:32:14Z

Honestly I think cyphar just went off the rails a bit here - it's likely all the goals could be achieved with a canonical tar representation and hence be compatible with all existing tar parsers.

mtrmac · 2021-05-06T14:43:56Z

But one thing that IMO is so key about both ostree and git is that the format is reproducible - there is exactly one canonical representation of a git commit and an ostree commit

That seems hard to do with compressed artifacts (we want to allow for bug fixes or improvements to the compression implementation, and for image creators to use different programming languages = different implementations — also the implementations might not produce reproducible data just because it wasn’t a goal), if we don’t keep the compressed variant around (like I guess Git does?). We don’t currently keep it around, and the disk space impact could be pretty significant in absolute terms, easily tens of gigabytes (I have no idea what it is relative vs. available disk space).

If we're going to create effectively a new container format I think it's worth considering addressing that at the same time. Which would here probably end up as something like a "canonical tar format" - and e.g. require that files in the input stream are sorted by name in UTF-8 order, etc.

Yeah, this pair of PRs is big enough and invasive enough that it does warrant some thinking whether we could be getting more features for similar amount of complexity.

Intuitively I don’t see reproducibility (from pulled and expanded image stores) as that much of an end goal; users shouldn’t be copying images by pull&push in any case, it’s going to be less efficient than a straight registry-to-registry copy no matter what we do. (Reproducibility is a useful property for a build system, which is a more constrained problem, e.g. it can choose to use a reproducible compression implementation or freeze an implementation version in a way the whole ecosystem can’t.)

It seems interesting to me to think about the size/speed aspects of the incremental pull, to see if we can improve those. (If it turns out that we need a canonical uncompressed tar format for that, sure, let’s; it just doesn’t look like a natural place to start.) At least my life would be easier if we had only a small number of delta/incremental-update mechanisms, instead of giving the user a menu of 5 and having to worry about the 2^5 possible code paths in the worst case; so a format that allows best-of-all-worlds is a nice dream :)

OTOH that’s all entirely theoretical until there’s an idea for how to do that, and interest+time to implement it; and this PR shouldn’t be blocked on an unachievable hypothetical.

[In a sense, the uncompressed tar format is reproducible already, due to tar-split. (That‘s the other kind of reproducible, from a pulled image, not from a filesystem during a build process.) I don’t know whether that helps because I’m not sure what we need reproducibility/canonical form for. Just for more layer reuse across separately-built images?]

cgwalters · 2021-05-06T18:02:28Z

That seems hard to do with compressed artifacts

Yes; we can really only care about the uncompressed version. Clients would need to verify the uncompressed checksum and not the compressed one.

That said, after some reflection I think "canonical tar" could be done after this in theory - they're mostly orthogonal problems.

[In a sense, the uncompressed tar format is reproducible already, due to tar-split.

(looked this up, seems to be 9b04055 )

I don’t know whether that helps because I’m not sure what we need reproducibility/canonical form for. Just for more layer reuse across separately-built images?]

Basically the same rationale as https://reproducible-builds.org/ - the downside of tar-split is basically that the original tar stream becomes canonical and has to be carried around.

giuseppe · 2021-05-07T09:29:12Z

I've marked all the APIs in containers/storage and containers/image as experimental, to have the freedom of introducing breaking changes until we are happy with it. I am planning to do more incremental improvements later, as well as possibly support stargz/estargz as well for compatibility with clients and registries that don't support zstd yet.

The reason why I am pushing to get this merged is that we can finally get users to play with it and get some feedback, and not expect users to deal with re-vendoring containers/storage and containers/image into the container tools.

We can still improve the new format and we add tar reproducibility later. I think the tar-split blob can be embedded in the zstd stream, in the same way the TOC metadata is added right now.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

add a new custom variant of the zstd compression that permits to retrieve each file separately. The idea is based on CRFS and its stargz format for having seekable and indexable tarballs. One disadvantage of the stargz format is that a custom file is added to the tarball to store the layer metadata, and the metadata file is part of the image itself. Clients that are not aware of the stargz format will propagate the metadata file inside of the containers. The zstd compression supports embeddeding additional data as part of the stream that the zstd decompressor will ignore (skippable frame), so the issue above with CRFS can be solved directly within the zstd compression format. Beside this minor advantage, zstd is much faster and compresses better than gzip, so take this opportunity to push the zstd format further. The zstd compression is supported by the OCI image specs since August 2019: opencontainers/image-spec#788 and has been supported by containers/image since then. Clients that are not aware of the zstd:chunked format, won't notice any difference when handling a blob that uses the variant. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe · 2021-05-07T09:32:53Z

fedora-32 vfs fails to start, but I guess we can probably drop tests on fedora 32

Romain-Geissler-1A · 2021-05-07T13:43:36Z

Hi,

FYI, I am speaking from a docker/podman user point of view.

If I understand correctly this proposal, this is something more or less similar to the eStargz proposal, but using zstd instead, to be able to inject metadata unknown to old client impactless, where currently eStargz would result in having some unexpected files injected inside the running container. Also, if I understand correctly, this allows to share on the client side parts of files which may be found in different and unrelated images, if these files share the same hash, and that is really cool ! While this proposal seems to be a very good improvement from the "pull" side, is it also an improvement for "push" scenario, and the storage on remote registry ? I mean, let's say a file content exists already in some image of a remote registry, if I build a local image uzing this zstd format, which contains a file with the same content, when I will run "podman push", will it push the full ztd tar file, or only the parts that the registry doesn't know yet ? And on registry side, will it be able to serve the "big" zstd file as a collection of smaller concatenatd zstd tar files and thus only store split zstd tar files rather than the "big" ones ?
If this pull requests also eventually allows to improve the "push" scenario, that is super cool !

Let me explain why we in Amadeus might be interested by such a proposal (as users). We have many applications which are C++ based. We do use (for now) mainly dynamic linking, and contrary to other companies our teams are very inter-dependant: we have several layer of C++ middleware teams developing diverse library, that are eventually reused by application developers. Our applications are quite big and tend to need many dependencies from various teams, compiled .so shared libraries are often 10MB files, up to sometimes 100+MB, even after being stripped. We have thousands of these libraries in total, big applications might require a lot of them. In production images, we install part of these libraries using "RUN dnf install some-amadeus-package", but the majority of these libraries are installed a simpler way, just untaring some tar.gz file in a new OCI layer.

To develop internally, we also use Docker images, so these come with GNU C/C++ based toolchains, which also are very heavy (a single toolchain containing both gcc + LLVM tools can easily weight 1GB in total). While we do offer one "standard SDK" base image where most tools are present, some team make the choice to develop their own image, writing their own Dockerfile. Most of our tools are available as RPM packages, so they naturally write "RUN dnf install some-amadeus-packages && dnf clean all".

The fact that we have many teams over dependant on each other and that each set of dependencies in every team is different results that in the end, we have many different SDK/production images in our registries, and layer's storage diff are very rarely shared between common images, despite we have MANY big files in common between all these various images. Anything based on "dnf" is doomed to fail to share layers as "dnf install" will result in having unique timestamsp, but not only (ie we can't just fix it by using "podman build --timestamp"), as "dnf" internally relies on a database, which will contain different package list for every team, thus look different from a filesystem point of view, thus leading to different tar.gz sha256).

So on Amadeus side, proposals that would allow to share not only "big" layer's filesystem diff tar.gz, but really individual file content part (so without taking into account metadata) of these tar.gz (or tar.zstd) would definitely help us save a lot of disk space and network bandwidth to download/push just the "new" files. We are concerned by the registry side as well, not only the client side, as we do host our own internal JFrog Artifactory registry, and we regularly receive mails to clean all the docker images we don't use anymore as it fills up pretty quickly.

I am curious about this one, and about the mentioned OCIv2 as well ;)

Cheers,
Romain

ktock · 2021-05-07T14:52:37Z

@Romain-Geissler-1A

This is exactly one of the requirements of OCIv2! It might require the API changes on the registry side.

But if the registry can understand seekable compression formats (e.g. zstd:chunked, eStargz, stargz) and the TOC format, I think we don't even need to change image spec, by peforming PATCH upload to these layers, in chunk-level granularity.

Anyway, as mentioned #775 (comment), I think we can care about it in following-up PRs.

giuseppe · 2021-05-10T19:12:13Z

@mtrmac @cgwalters are you fine with the current status of this PR? Or anything against moving this forward?

cgwalters · 2021-05-11T15:52:01Z

In rpm-ostree we have rpm-ostree ex <subcommand> which is a place to ship commands that are marked experimental. Does podman have anything like this? Basically it creates a middle ground between the otherwise vast gulf separating "code in a PR" and "code shipped to vast numbers of people via distribution stable releases".

In particular as soon as this is vendored in podman for example, then it becomes effectively frozen in that we need to support the format ~forever right? Whereas if it requires something like echo enable-experimental-zstd-chunked: true >> ~/.containers/storage.conf that's a different thing.

I only skimmed the design, and barely any of the code.

mtrmac · 2021-05-11T16:02:48Z

@mtrmac @cgwalters are you fine with the current status of this PR? Or anything against moving this forward?

I am still seriously worried about host file reuse. Why is it known to be secure?
Shouldn’t @nalind review this either way? (I’m not all that familiar with the codebase, and I have only quickly skimmed it once.)

rhatdan · 2021-05-11T18:16:21Z

As far as scanning for matching files in the Base OS @giuseppe has agreed to make this optional and just for /usr. We can turn it off by default, and then for certain workloads users could turn it on.

giuseppe · 2021-05-12T13:25:22Z

As far as scanning for matching files in the Base OS @giuseppe has agreed to make this optional and just for /usr. We can turn it off by default, and then for certain workloads users could turn it on.

it is already disabled by default in the last version of these patches and it must be enabled in the storage.conf configuration file.

In particular as soon as this is vendored in podman for example, then it becomes effectively frozen in that we need to support the format ~forever right? Whereas if it requires something like echo enable-experimental-zstd-chunked: true >> ~/.containers/storage.conf that's a different thing.

it won't be enabled by default yet, it must be explicitly configured (haven't decided yet if through the storage.conf config file or through some env variable)

giuseppe · 2021-05-14T13:12:32Z

@mtrmac @nalind @rhatdan so fine to move forward with this PR (and considering the changes needed in c/image)?

rhatdan · 2021-05-14T20:11:52Z

Since this is experiemental, I am going to merge, so we can move forward.

cgwalters · 2021-08-02T13:22:18Z

I randomly came across https://github.com/dragonflyoss/image-service - not sure if you all have seen it. May be worth a "Related projects" section somewhere?

This was referenced Nov 19, 2020

pkg/compression: new zstd variant zstd:chunked containers/image#1084

Merged

[TEST] skopeo: test zstd:chunked containers/skopeo#1111

Closed

giuseppe force-pushed the zstd-chunked branch 5 times, most recently from eb34310 to 69ca881 Compare November 23, 2020 15:28

giuseppe force-pushed the zstd-chunked branch 15 times, most recently from fe91364 to 66a8934 Compare December 1, 2020 12:18

giuseppe force-pushed the zstd-chunked branch 6 times, most recently from b8cfe6b to 264b1a2 Compare December 4, 2020 21:20

giuseppe added 4 commits May 7, 2021 11:29

drivers: add interface to directly write diff

c2a5fd1

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

overlay: implement direct diff access

c2c9ca9

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

types: add map PullOptions to the conf file

7400647

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the zstd-chunked branch from f6a7fc9 to a0ff89e Compare May 7, 2021 09:29

rhatdan approved these changes May 7, 2021

View reviewed changes

rhatdan merged commit 4f69205 into containers:master May 14, 2021

giuseppe mentioned this pull request Aug 23, 2021

Sharing memory between containers #996

Closed

giuseppe mentioned this pull request Apr 19, 2022

Btrfs dedupe images Raw containers/podman#13894

Closed

giuseppe mentioned this pull request Aug 24, 2022

Seekable tarfs quay/claircore#669

Open

3 tasks

ktock mentioned this pull request Mar 6, 2024

Document eStargz with zstd compression instead of only gzip containerd/stargz-snapshotter#1596

Open

travier mentioned this pull request Mar 27, 2024

Add support for zstd:chunked ostreedev/ostree-rs-ext#608

Closed

cgwalters mentioned this pull request Apr 28, 2024

zstd fails with archives using skippable frames Nullus157/async-compression#271

Closed

mtrmac mentioned this pull request May 2, 2024

overlay: lock staging directories #1916

Merged

travier mentioned this pull request Jun 7, 2024

Push zstd::chunked compressed container images to another tag toolbx-images/images#128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable zstd:chunked support in containers/image #775

Enable zstd:chunked support in containers/image #775

giuseppe commented Nov 19, 2020 •

edited

Loading

rhatdan commented Nov 19, 2020

giuseppe commented Dec 1, 2020 •

edited

Loading

cgwalters commented May 6, 2021

mtrmac commented May 6, 2021

cgwalters commented May 6, 2021 •

edited

Loading

giuseppe commented May 7, 2021 •

edited

Loading

giuseppe commented May 7, 2021

Romain-Geissler-1A commented May 7, 2021 •

edited

Loading

ktock commented May 7, 2021

giuseppe commented May 10, 2021

cgwalters commented May 11, 2021

mtrmac commented May 11, 2021

rhatdan commented May 11, 2021

giuseppe commented May 12, 2021

giuseppe commented May 14, 2021

rhatdan commented May 14, 2021

cgwalters commented Aug 2, 2021

Enable zstd:chunked support in containers/image #775

Enable zstd:chunked support in containers/image #775

Conversation

giuseppe commented Nov 19, 2020 • edited Loading

rhatdan commented Nov 19, 2020

giuseppe commented Dec 1, 2020 • edited Loading

cgwalters commented May 6, 2021

mtrmac commented May 6, 2021

cgwalters commented May 6, 2021 • edited Loading

giuseppe commented May 7, 2021 • edited Loading

giuseppe commented May 7, 2021

Romain-Geissler-1A commented May 7, 2021 • edited Loading

ktock commented May 7, 2021

giuseppe commented May 10, 2021

cgwalters commented May 11, 2021

mtrmac commented May 11, 2021

rhatdan commented May 11, 2021

giuseppe commented May 12, 2021

giuseppe commented May 14, 2021

rhatdan commented May 14, 2021

cgwalters commented Aug 2, 2021

giuseppe commented Nov 19, 2020 •

edited

Loading

giuseppe commented Dec 1, 2020 •

edited

Loading

cgwalters commented May 6, 2021 •

edited

Loading

giuseppe commented May 7, 2021 •

edited

Loading

Romain-Geissler-1A commented May 7, 2021 •

edited

Loading