Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using historical build information for encapsulation #4247

Closed
2 of 3 tasks
RishabhSaini opened this issue Jan 4, 2023 · 10 comments · Fixed by #4271
Closed
2 of 3 tasks

Using historical build information for encapsulation #4247

RishabhSaini opened this issue Jan 4, 2023 · 10 comments · Fixed by #4271
Assignees
Labels
jira for syncing to jira

Comments

@RishabhSaini
Copy link
Contributor

RishabhSaini commented Jan 4, 2023

Part of 4012 Solution 0
Here we take external data using historical build information and the external process decides on the chunking.

@RishabhSaini RishabhSaini self-assigned this Jan 4, 2023
@RishabhSaini RishabhSaini added the jira for syncing to jira label Jan 4, 2023
@RishabhSaini
Copy link
Contributor Author

RishabhSaini commented Jan 10, 2023

I would need to read the packaging structure of previous builds to prevent a major packaging structure change.
This metadata about content of each layer can go in oci.layers annotation in the manifest.

Since the previous builds (latest - 1, latest - 2), don't necessarily have tags attached to them in quay, I could ask the user to provide the versions (latest- 1 or latest -2). But this does not suffice the need to query fcos release browser because I would still need to use container-img-proxy to open the image and get the manifest. Since c/img-proxy forks skopeo as a process, skopeo needs to be able to understand the the fcos-release-browser versions.

This potentially block on fedora-coreos-tracker/issues/#1367
Till then I could just take the latest build.

Is there a potentially better way of getting the manifests of older builds from skopeo?

@cgwalters
Copy link
Member

This topic heavily relates to coreos/fedora-coreos-tracker#1367

First though, I think this technology should not be specific to Fedora CoreOS. At least, not at a low level.

So my strawman proposal would look something like this

$ rpm-ostree compose container-encapsulate --previous-build quay.io/fedora/fedora-coreos@sha256:abc --previous-build quay.io/fedora/fedora-coreos@sha256:123 ...

where basically a set of direct previous builds are provided, and we aim to optimize the delta from those.

In Fedora CoreOS we've commonly wrapped rpm-ostree compose functionality inside coreos-assembler - so that would be the place to have something that scraped FCOS/cosa build metadata and generated the container image arguments passed to --previous-build or so.

@RishabhSaini
Copy link
Contributor Author

RishabhSaini commented Jan 20, 2023

Upon encapsulating a new rpm-ostree commit, there are four possible changes that can happen to the underlying packages:

  • The package remains unchanged
  • The package (epoch, version) changes (updates or downgrades)
  • The package gets completely deleted
  • A new package gets added

This will influence the structure of the way layers of the image are packed:

  • Layer 1: Contains the original ostree-commit
  • Layer 2 to n-1: Contains all the rpms, initramfs, rpmostree_unpackaged_content
  • Layer n: I am proposing this contains any new packages added to the rpm-ostree commit

This is because the most common changes that happen between builds are:

  • package updates/downgrades
    As expected, a package update should cause podman to download the entire layer
  • new packages getting added
    However, a new package getting added should not cause a layer to be redownloaded + new_pkg. Therefore, new
    packages should not be placed in one of the previous bins and instead be placed in the last bin.

This last bin will be emptied once the bodhi-scraper causes a new packing structure to be implemented every major release. This way the most optimal packing structure can be used every major release and in between two major releases, the packing structure is unchanged and new packages are dumped into the last bin.

@RishabhSaini
Copy link
Contributor Author

To implement the functionality above, I would need to parse nevra of packages in ostree-rs-ext. There exists hy_split_nevra in rpm-ostree. How would I implement this in ostree-rs-ext?

@cgwalters
Copy link
Member

However, a new package getting added should not cause a layer to be redownloaded + new_pkg. Therefore, new
packages should not be placed in one of the previous bins and instead be placed in the last bin.

Hmmm. Two things:

  • That "last bin" also means that every other layer is a bit bigger - hence it increases the "size amplification" effect and I think we need to weigh against that.
  • Also if the "last bin" is going to become non-empty often (and while adding/removing packages doesn't happen often, it definitely does) that means we also commonly need to consider that case too...and also think about "degenerate" cases when a newly added package might change often (which seems like it may correlate) and there's other new packages in the bin.

@cgwalters
Copy link
Member

To implement the functionality above, I would need to parse nevra of packages in ostree-rs-ext. There exists hy_split_nevra in rpm-ostree. How would I implement this in ostree-rs-ext?

I think we can split it...how about starting like this?

diff --git a/lib/src/objectsource.rs b/lib/src/objectsource.rs
index 96d87e5..e032d62 100644
--- a/lib/src/objectsource.rs
+++ b/lib/src/objectsource.rs
@@ -39,6 +39,9 @@ pub struct ObjectSourceMeta {
     /// Unique identifier, does not need to be human readable, but can be.
     #[serde(with = "rcstr_serialize")]
     pub identifier: ContentID,
+    /// Unique version identifier, does not need to be human readable, but can be.
+    #[serde(with = "rcstr_serialize")]
+    pub version: Rc<str>,
     /// Identifier for this source (e.g. package name-version, git repo).
     /// Unlike the [`ContentID`], this should be human readable.
     #[serde(with = "rcstr_serialize")]

Basically identifier becomes name, then version becomes the rest (epoch, version, release, architecture)?

Then the rest of the code needs to consider version too.

RishabhSaini added a commit to RishabhSaini/rpm-ostree that referenced this issue Feb 9, 2023
This is a prep PR to completing coreos#4247.
It allows one to diff the layers of current encapsulated build to any other build.
RishabhSaini added a commit to RishabhSaini/rpm-ostree that referenced this issue Feb 9, 2023
This is a prep PR to completing coreos#4247.
It allows one to diff the layers of current encapsulated build to any other build.
@RishabhSaini
Copy link
Contributor Author

RishabhSaini commented Mar 14, 2023

High Level Design of the data flow:

There are two "sources" to get metadata about package updates from:

  • Bodhi (for FCOS based packages)
  • Errata (for RHCOS based packages)

These sources will be modified to either have an API or a file (frequemcyupdateinfometadata.json) in their repodata that contains the metadata about the updates. This metadata has the exact similar format as the result from Query API. It contains the list of all updates present in the current and pending release state with a stable update status and has reduced fields (as needed by the post-processing). A time based cache will be inserted in this layer to prevent latency

This metadata generated from bodhi and errata has the same format, so that it can be consumed by the same post-processor to generate a updates_frequency file for each of them respectively. The updates_frequency_rhcos.json and updates_frequency_fcos.json file then will be committed to fcos-config and rhcos-config respectively.

Either RPM-OStree or COSA can then fetch and read this file to serve the purpose of heuristic for encapsulation

@travier
Copy link
Member

travier commented Apr 11, 2023

Character 'F' specifies FCOS specific task Character 'R' specifies RHCOS specific task

From my perspective, this list is hard to read and would benefit from a un-de-duplication to be more explicit about the steps and difference between RHCOS & FCOS. Note that we also have SCOS.

@RishabhSaini
Copy link
Contributor Author

Character 'F' specifies FCOS specific task Character 'R' specifies RHCOS specific task

From my perspective, this list is hard to read and would benefit from a un-de-duplication to be more explicit about the steps and difference between RHCOS & FCOS. Note that we also have SCOS.

Ok updated

@RishabhSaini
Copy link
Contributor Author

RishabhSaini commented Apr 17, 2023

After consultation with the libdnf team, a new approach for getting the list of updates made to the packages was considered. This involves using RPM Tag data API to read the rpm header data for changelog timestamps. This array of timestamps "record the changes that have happened to the package between different Version or Release builds". Hence they are build times of each update.

Cons of this approach:

  • Not all builds are necessarily shipped and hence don't represent accurately the frequency of updates to the packages in rpm-ostree bases OS.

Pros:

  • Frequencies of the builds of packages do in fact match with their releases in Bodhi, Errata with just their timestamps being off by a 15-30 days. Hence it might even help rpm-ostree better be able to prepare for the incoming shift in frequency of updates in packages and hence pack optimally in advance.
  • This method avoids having to create APIs in Bodhi, Errata to get a comprehensive list of updates, processing them and consuming them through rpmmds and config files in FCOS, SCOS, RHCOS respectively.
  • We also make this solution future proof, as it is not OS specific and works with any rpm-ostree based OS

Hence, discarding all of the OS specific efforts in packaging:

Tasks are listed from a top down approach for each intended rpm-ostree based OS

FCOS:

RHCOS:

  • Modify API to give comprehensive list of updates in Errata: TBD
  • Process the updates to make frequencyinfo.json in bodhi-scraper: TBD
  • Place frequencyinfo in rhcos-config: TBD

SCOS: Workflow TBD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants