Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust: Add a composefs-oci crate #286

Closed
wants to merge 2 commits into from
Closed

Conversation

cgwalters
Copy link
Contributor

@cgwalters cgwalters commented May 26, 2024

The high level goal of this crate is to be an opinionated
generic storage layer using composefs, with direct support
for OCI. Note not just OCI containers but also including
OCI artifacts too.

This crate could to be the successor to
the "storage core" of both ostree and containers/storage.


This is an initial sketch! It'd be good to sync on the design/goals; it started to "feel" right but obviously there's a whole lot going on. In particular, when I was going through the current containers/storage composefs code...there's a few things here where I am not sure it has the right architecture.

  • I think relying on zstd:chunked and especially the part about indexing objects by their sha256 is wrong. We should index by the fsverity digest - especially in a time when people want to store potentially multi-GB files in container images, computing a checksum in two different ways is quite suboptimal. Note that using a new C API exposed by this project, if the filesystem doesn't physically support fsverity, we just compute it in userspace.
  • The repo also tracks whether fsverity was supported at init time and can be configured to hard require it. By tracking enablement we can also support a "migrate repo to enabling fsverity" for e.g. XFS when it gains that support.
  • Related to the above on the bootc side we definitely want file sharing between images; this will lead to a need for GC handling (xref Add shared library/tool for managing backing store files #125 )
  • There's a logistical issue in Go vs Rust going on here...we can bridge it of course, but:
  • Managing file paths in userspace makes me nervous; there's lots of subtleties. This design sketch hard requires and uses openat2 for unpacking, ensuring the kernel handles things like canonicalization of // etc. and symlink traversal.

What I want to do next here is teach mkcomposefs to optionally honor a user.cfs.meta xattr. This way we can unpack a tarball to disk, but instead of e.g. physically setting sensitive file metadata like owner uid, gid, suid bits, and security-related xattrs, we can do something much like what ostree's bare-user mode does and store them in an xattr. We can also skip making specials like device nodes and FIFOs "physically" and just make a zero-sized regular file with the xattr.

anyhow::bail!("Requested fsverity, but target does not support it");
}
let meta = RepoMetadata {
version: String::from("0.5"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said 0.1 above.

};
let mut digest = Digest::new();
composefs::fsverity::fsverity_digest_from_fd(tmpfile.as_file().as_fd(), &mut digest)
.context("Computing fsverity digest")?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the filesystem doesn't support fs-verity, this will re-read the file to compute the fs-verity. We already have fs-verity digest support in libcomposefs that supports streaming, so we could have computed this digest while writing to the tmpfile above.

I guess that would be unnecessary in the case where the target fs supports fs-verity, as then we compute it twice, but we can avoid that if self.has_verity().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. I think though a better optimization will be to move the fsverity computation (whether userspace or kernel) into a worker thread pool, so we leave the main thread's job to basically reading from the network and saving data as quickly as possible. This way we avoid intermixing network and CPU bound work on the same thread. (OTOH on modern processors with sha256 acceleration maybe it doesn't matter, I haven't measured)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't know what the cost is of sha256 these days. Sometimes modern CPUs are fast enough that doing computation on memory is basically free compared to the memory bandwidth cost, but I don't know if that is true with sha256. And in this case its even compared to i/O as well. If the file is large, then we may actually have to read it back from disk (rather than cache).

I did some tests with sha256sum on my (pretty fast) machine on a 1GB file. cat to /dev/null takes ~270 msec uncached, and ~65 msec cached (real time). Running sha256sum instead takes 460 msec uncached (although this time varies more), and 440 msec cached.

I guess we can say that sha256 computation is the more costly operation compared to disk i/o. Cached read+sha is 7x time slower than a cached read. And total sha256 time is almost the same for uncached and cached reads, meaning that the disk read times were small (or at least pipelined) compared to the computation time.

Given the above we might want something like a threadpool that does fixed block size sha56 computation across n_cpu threads. Then we could just feed it block by block, getting parallelization even across a single large file.

@alexlarsson
Copy link
Collaborator

What I want to do next here is teach mkcomposefs to optionally honor a user.cfs.meta xattr. This way we can unpack a tarball to disk, but instead of e.g. physically setting sensitive file metadata like owner uid, gid, suid bits, and security-related xattrs, we can do something much like what ostree's bare-user mode does and store them in an xattr. We can also skip making specials like device nodes and FIFOs "physically" and just make a zero-sized regular file with the xattr.

I'm not sure I understand why you want this? During the import, can't you just compute the dump-format file line by line in combination with the object files and pass that to mkcomposefs? Why do you need the file metadata "on-disk"?

This allows sharing them easily, as we have multiple crates.

Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters cgwalters force-pushed the rust-oci branch 3 times, most recently from 3b2f2b2 to 4318f72 Compare August 3, 2024 18:08
The high level goal of this crate is to be an opinionated
generic storage layer using composefs, with direct support
for OCI.  Note not just OCI *containers* but also including
OCI artifacts too.

This crate is intended to be the successor to
the "storage core" of both ostree and containers/storage.

Signed-off-by: Colin Walters <walters@verbum.org>
@cgwalters
Copy link
Contributor Author

For now I think it makes sense actually to keep this repository with just "core" functionality.
I have created https://github.com/cgwalters/composefs-oci as an experimental place to iterate.

@cgwalters cgwalters closed this Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants