Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for creating a composefs from a directory #36

Merged
merged 5 commits into from
Nov 20, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions doc/oci.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# How to create a composefs from an OCI image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really useful document!


This document is incomplete. It only serves to document some decisions we've
taken about how to resolve ambiguous situations.

# Data precision

We currently create a composefs image using the granularity of data as
typically appears in OCI tarballs:
- atime and ctime are not present (these are actually not physically present
in the erofs inode structure at all, either the compact or extended forms)
- mtime is set to the mtime in seconds; the sub-seconds value is simply
truncated (ie: we always round down). erofs has an nsec field, but it's not
normally present in OCI tarballs. That's down to the fact that the usual
tar header only has timestamps in seconds and extended headers are not
usually added for this purpose.
- we take great care to faithfully represent hardlinks: even though the
produced filesystem is read-only and we have data de-duplication via the
objects store, we make sure that hardlinks result in an actual shared inode
as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem.

We apply these precision restrictions also when creating images by scanning the
filesystem. For example: even if we get more-accurate timestamp information,
we'll truncate it to the nearest second.

# Merging directories

This is done according to the OCI spec, with an additional clarification: in
case a directory entry is present in multiple layers, we use the tar metadata
from the most-derived layer to determine the attributes (owner, permissions,
mtime) for the directory.

# The root inode

The root inode (/) is a difficult case because it doesn't always appear in the
layer tarballs. We need to make some arbitrary decisions about the metadata.

Here's what we do:

- if any layer tarball contains an empty for '/' then we'd like to use it.
The code for this doesn't exist yet, but it seems reasonable as a principle.
In case the `/` entry were to appear in multiple layers, we'd use the
most-derived layer in which it is present (as per the logic in the previous
section).
- otherwise:
- we assume that the root directory is owned by root:root and has `a+rx`
permissions (ie: `0555`). This matches the behaviour of podman. Note in
particular: podman uses `0555`, not `0755`: the root directory is not
(nominally) writable by the root user.
- the mtime of the root directory is taken to be equal to the most recent
file in the entire system, that is: the highest numerical value of any
mtime on any inode. The rationale is that this is usually a very good
proxy for "when was the (most-derived) container image created".
41 changes: 37 additions & 4 deletions src/image.rs
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,20 @@ impl Directory {
pub fn remove_all(&mut self) {
self.entries.clear();
}

pub fn newest_file(&self) -> i64 {
let mut newest = self.stat.st_mtim_sec;
for DirEnt { inode, .. } in &self.entries {
let mtime = match inode {
Inode::Leaf(ref leaf) => leaf.stat.st_mtim_sec,
Inode::Directory(ref dir) => dir.newest_file(),
};
if mtime > newest {
newest = mtime;
}
}
newest
}
}

pub struct FileSystem {
Expand All @@ -172,10 +186,10 @@ impl FileSystem {
FileSystem {
root: Directory {
stat: Stat {
st_mode: 0o755,
st_uid: 0,
st_gid: 0,
st_mtim_sec: 0,
st_mode: u32::MAX, // assigned later
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Option (here and below)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require adding Option on the Stat structure, which doesn't make sense for any case other than this one. This is ugly, but it works.

st_uid: u32::MAX, // assigned later
st_gid: u32::MAX, // assigned later
st_mtim_sec: -1, // assigned later
xattrs: RefCell::new(BTreeMap::new()),
},
entries: vec![],
Expand Down Expand Up @@ -246,6 +260,25 @@ impl FileSystem {
todo!();
}
}

pub fn done(&mut self) {
// We need to look at the root entry and deal with the "assign later" fields
let stat = &mut self.root.stat;

if stat.st_mode == u32::MAX {
stat.st_mode = 0o555;
}
if stat.st_uid == u32::MAX {
stat.st_uid = 0;
}
if stat.st_gid == u32::MAX {
stat.st_gid = 0;
}
if stat.st_mtim_sec == -1 {
// write this in full to avoid annoying the borrow checker
self.root.stat.st_mtim_sec = self.root.newest_file();
}
}
}

pub fn mkcomposefs(filesystem: FileSystem) -> Result<Vec<u8>> {
Expand Down
2 changes: 2 additions & 0 deletions src/oci/image.rs
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ pub fn compose_filesystem(repo: &Repository, layers: &[String]) -> Result<FileSy
}

selabel(&mut filesystem, repo)?;
filesystem.done();

Ok(filesystem)
}
Expand Down Expand Up @@ -98,6 +99,7 @@ pub fn create_image(
}

selabel(&mut filesystem, repo)?;
filesystem.done();

let image = mkcomposefs(filesystem)?;
repo.write_image(name, &image)
Expand Down