Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new type PAYLOAD_LINK #1443

Closed
wants to merge 3 commits into from

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented Feb 2, 2018

It is used to keep track of the payload checksum for files stored in the repository.

The goal is that files having the same payload but different xattrs can take advantage of reflinks where supported.

More details here: https://mail.gnome.org/archives/ostree-list/2018-January/msg00012.html

@cgwalters
Copy link
Member

I wonder if it'd be simpler to add a special xattr that we know about and also filter out.

@giuseppe
Copy link
Member Author

giuseppe commented Feb 5, 2018

how could we use the xattr to lookup for the payload checksum?

@cgwalters
Copy link
Member

how could we use the xattr to lookup for the payload checksum?

Yeah, that's an issue. We could scan all objects, but that'd get slow unless amortized.

So...hmm. A big picture question here in my mind is the (still unsettled?) degree to which libostree provides low-level APIs for the OCI case, versus doing things at a higher level.

If we provide high level APIs, we're more free to change/optimize the implementation details later.

One other random thought here - how about only doing this for say files over 5MB (or some configurable threshold) instead? Trading off indexing overhead versus space savings.

Another issue here is I think it needs to be opt-in; otherwise we're taking up extra space for "pure libostree" users who aren't doing containers/rpms/whatever on top.

@giuseppe
Copy link
Member Author

giuseppe commented Feb 6, 2018

yes good point. I've added a new commit that implements a minimum threshold configuration, the payload link will be created only for files that have a size greater or equal to the threshold.

Are you fine with a default value of 3 MiB?

If you are for disabling this feature by default, we can setup a much bigger value by default, that has the same effect.


About OCI, we should probably have a bigger discussion around it. If we deal with it directly in OSTree, we would end up duplicating a lot of functionalities that are already in containers/image. For a third party project can be painful to interact both with Golang and C but having the OCI bits in containers/image has its advantages. OSTree storage can use/be used with all the other tools that are already using containers/image, such as Buildah.

One feature I particularly like is:

containers/image#392

In addition to copy images to the OSTree storage, we can also copy them back to other storages, like to the Docker engine or to a registry.

The OSTree storage part of containers/image even if used only for system containers, is a direct mapping from OCI to OSTree, that can be used for any kind of OCI images. There are no system containers features in the storage part.

@giuseppe giuseppe changed the title [RFC] Introduce new type PAYLOAD_LINK Introduce new type PAYLOAD_LINK Feb 9, 2018
@giuseppe
Copy link
Member Author

Any feedback on the design?

@cgwalters
Copy link
Member

cgwalters commented Feb 13, 2018

Are you fine with a default value of 3 MiB?

But that's still imposing an unnecessary cost on everyone who is doing "pure ostree" embedded devices, etc. We could have it be a repo flag like

$ cat /ostree/repo/config 
[core]
repo_version=1
mode=bare
contentindex=true

or something. And for Project Atomic systems we'd enable that flag.

Or perhaps have an ostree_repo_enable_content_indexing().

But the configuration entrypoint aside, my main concern here is we're basically doubling the number of files we create; on this workstation:

$ find /ostree/repo/objects/ -name *.file |wc -l
150393
$ find /ostree/repo/objects/ |wc -l
169358

And like I said before I'm already not totally happy with how many small metadata files we have. In fact it'd be interesting to investigate not hardlinking for small objects at all. Is it really worth it to do hardlinks for e.g. < 1k files? Less pressure on the filesystem journal to do inode updates?

Another concern is that now pruning objects isn't atomic; it looks like you made the commit process ignore "dangling" payload links, but it should probably delete them so they can be recreated.

@cgwalters
Copy link
Member

Also big picture, we know this is an issue for SELinux systems, but what about SMACK/AppArmor? IIRC SMACK uses xattrs but I don't know how aggressive it is with distinct labels like SELinux is.

TBH I don't think it's too worth your time digging into SMACK/AppArmor, but there are definitely ostree users who don't use SELinux, and in that case again we don't need the context indexing, right?

@giuseppe
Copy link
Member Author

the number of files >3MiB is a small subset of all the files, this is what I have in the my local repository:

$ find -name '*.file' | wc -l
9603
$ find -name '*.payload-link' | wc -l
10

but anyway, it still requires the support from the file system to be useful, so I will change the default to infinite for now, or do you prefer a different option?

Yes I am not happy as well about the pruning algorithm, do you think we should recalculate everything? I didn't implement this way as it looks quite expensive (prune will be as expensive as fsck), maybe I can delete them all and recreate only those for files that are still present in the repository after the pruning.

@cgwalters
Copy link
Member

Oh wow, you're right...that's actually pretty amazing; on my current workstation the ratio of "3MB+" objects is 211/150393 or 0.1%. At a quick glance there's statically linked golang binaries, but also fonts, qemu for some reason, and the kernel/initramfs.

@cgwalters
Copy link
Member

And on a stock fedora-atomic:fedora/27/x86_64/atomic-host 27.61 (2018-01-17 15:52:47) with no packages layered, the ratio is 40/26030, also at 0.1%.

@cgwalters
Copy link
Member

cgwalters commented Feb 13, 2018

OK so there's a higher level issue here too I just thought of: this code only has an effect for commits created locally (as we do when importing OCI). When we're pulling objects directly via libostree (ostree pull, "rpm-ostree jigdo"), the client won't get this data.

So we'd have to either add this to pull (which in archive fetches would double the number of http requests, probably a non-starter), or add it to some lookaside data (messy). We can include it in static deltas easily enough though, and it should be straightforward to teach jigdo how to do it too.

@cgwalters
Copy link
Member

Though hmm...since we're computing checksums at pull time now anyways usually, we could just do two SHA-256 checksums simultaneously.

@giuseppe giuseppe force-pushed the payload-link branch 4 times, most recently from 9105146 to 7b4815d Compare February 14, 2018 14:35
@giuseppe
Copy link
Member Author

I checked "ostree pull" and this code path also creates the .payload-link. I see this like a local optimization that should not go on the wire. Also since we allow to change the threshold for what file size must be considered, we will need the client to check if the received .payload-link must be discarded as it might point to a file bigger than core.payload-link-threshold`.

I am still chasing down the failure in the CI, although I've done some changes in the PR:

  1. if the symlink is dangling then it is unlinked.
  2. I've implemented a lookup in the parent repository
  3. At pull time we always check if the link must be created, even for object files that are already present in the repository

@giuseppe giuseppe force-pushed the payload-link branch 2 times, most recently from 2bf8fea to 6242cc5 Compare February 15, 2018 11:22
@giuseppe
Copy link
Member Author

tests pass again ✌️

file_input = (GInputStream*)checksum_input;
else
{
checksum_payload_input = ot_checksum_instream_new ((GInputStream*)checksum_input, G_CHECKSUM_SHA256);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth a comment here like:

/* The payload checksum-input reads from the full object checksum-input; this
 * means it skips the header.
 */

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -233,6 +260,7 @@ repo_prune_internal (OstreeRepo *self,
g_autoptr(GHashTable) reachable_owned = g_hash_table_ref (options->reachable);
data.reachable = reachable_owned;


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spurious extra newline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped in the new version

@cgwalters
Copy link
Member

cgwalters commented Feb 16, 2018

I checked "ostree pull" and this code path also creates the .payload-link.

Yeah, I had it backwards; it's the "trusted" cases that aren't; basically whenever we aren't redoing the SHA-256. So for example ostree pull-local which is used by Anaconda in the default case means that the default install won't have payload links. We'd have to teach the pull-local code to copy them.

@giuseppe giuseppe force-pushed the payload-link branch 4 times, most recently from 1dd6833 to 9995c8e Compare March 3, 2018 18:45
@rh-atomic-bot
Copy link

☔ The latest upstream changes (presumably 733c049) made this pull request unmergeable. Please resolve the merge conflicts.

@giuseppe
Copy link
Member Author

giuseppe commented Mar 5, 2018

@cgwalters are you fine to merge this?

@cgwalters
Copy link
Member

The test is being skipped right now though: https://s3.amazonaws.com/aos-ci/ghprb/ostreedev/ostree/14c00f4e8527d8b81ffc6d06cca854b3fbd187e8.0.1520269933286777277/artifacts/gdtr-results/libostree_test-payload-link.sh.test.txt

I know our current testing setup is very confusing. I'm working on addressing that (well, some things will be more confusing, others less so) in #1462

Anyways so what you want is to add a case to tests/installed, which are basically pure shell scripts that run as root on a FAH VM and not in a container. I can probably take care of doing this if you prefer.

@cgwalters
Copy link
Member

The "dev flow" I have for the tests/installed stuff is basically to hack in my dev container, then sync my git repo into the vm, and just run the script there.

it was removed with:

commit 8609cb0
Author: Colin Walters <walters@verbum.org>
Date:   Thu Apr 21 15:14:51 2016 -0400

    repo: Simplify internal has_object() lookup code

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
It will be used by successive commits to keep track of the payload
checksum for objects stored in the repository.

The goal is that files having the same payload but different xattrs can
take advantage of reflinks where supported.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member Author

giuseppe commented Mar 5, 2018

Anyways so what you want is to add a case to tests/installed, which are basically pure shell scripts that run as root on a FAH VM and not in a container. I can probably take care of doing this if you prefer.

I've pushed a new version with the installed test script. I hope it will pass the CI :-)

@giuseppe giuseppe force-pushed the payload-link branch 4 times, most recently from c2ddd8f to e9ecbdc Compare March 7, 2018 10:45
When a new object is added to the repository, create a
$PAYLOAD-SHA256.payload-link symlink file as well.  The target of the
symlink is the checksum of the object that was added the repository.

Whenever we add a new object file, in addition to lookup if the file is
already present with the same checksum we also check if an object with
the same payload is in the repository.

If a file with the same payload is already present in the repository, we
copy it with `glnx_regfile_copy_bytes` that internally attempts to
create a reflink (ioctl (..., FICLONE, ..)) to the target file if the
file system supports it.  This enables to have objects that share the
payload but have a different inode and xattrs.

By default the payload-link-threshold value is G_MAXUINT64 that disables
the feature.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member Author

giuseppe commented Mar 7, 2018

@cgwalters finally managed to get all the tests happy again :-)

cd ${test_tmpdir}

touch a
if cp --reflink a b; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm debating a bit if we should instead just assert that this works; I mean if we give reflink=1 to XFS it had better support it. But eh. We can tweak that later.

@cgwalters
Copy link
Member

Thanks for all of your work on this!

@rh-atomic-bot r+ 6994f22

@rh-atomic-bot
Copy link

⚡ Test exempted: merge already tested.

rh-atomic-bot pushed a commit that referenced this pull request Mar 7, 2018
It will be used by successive commits to keep track of the payload
checksum for objects stored in the repository.

The goal is that files having the same payload but different xattrs can
take advantage of reflinks where supported.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Closes: #1443
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this pull request Mar 7, 2018
When a new object is added to the repository, create a
$PAYLOAD-SHA256.payload-link symlink file as well.  The target of the
symlink is the checksum of the object that was added the repository.

Whenever we add a new object file, in addition to lookup if the file is
already present with the same checksum we also check if an object with
the same payload is in the repository.

If a file with the same payload is already present in the repository, we
copy it with `glnx_regfile_copy_bytes` that internally attempts to
create a reflink (ioctl (..., FICLONE, ..)) to the target file if the
file system supports it.  This enables to have objects that share the
payload but have a different inode and xattrs.

By default the payload-link-threshold value is G_MAXUINT64 that disables
the feature.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Closes: #1443
Approved by: cgwalters
@alexlarsson
Copy link
Member

This regressed flatpak, see #1524 for fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants