Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.11] Backports for RHCOS CI (--transient option & test entrypoint) #2963

Merged
merged 15 commits into from
Jul 7, 2022

Conversation

cgwalters and others added 6 commits July 6, 2022 14:24
We go to a lot of effort to create various caches of builds, and this
is very useful for local incremental development to avoid re-downloading
RPMs and rewriting data.

It is useless for CI builds that discard everything when they're done.
Add a `cosa init --transient` flag that for now just disables `fsync()`
on `tmp/repo`.

There's more we can do here in the future, for example we need to
propagate this too into `cache/repo-build` and `cache/pkgcache-repo`,
and actually in general we don't need the pkgcache repo at all here.
To make it clearer that things worked.
Followup to the addition of `--transient` for CI flows.  If
we had coreos/rpm-ostree#3719
this would mostly be unnecessary, but today in CI and prod builds,
ostree is invoking `fsync()` inside the supermin VM, and then
qemu in the container is going to `fsync()` all changes down to
the host system (overlay)fs.

For transient flows, switch to `cache=unsafe` so we stop doing
that which should greatly help speed.
So we test it, and so we also gain the speed benefits.
Hopefully will fix the `ci/prow/rhcos` context here.
Use a COSA specifc test entry point to focus on tests relevant for COSA.
Copy link
Member

@miabbott miabbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cgwalters
Copy link
Member

Looks like CI here hit #2946 - let's backport that too?

cgwalters and others added 9 commits July 7, 2022 13:43
This is part of coreos/fedora-coreos-tracker#1151

Our generated disk images are largely just a "shell" around the egg
of an ostree commit.  There is almost nothing that lives
in the disk image that isn't in the commit.

(This is especially true now that a preparatory commit previous to
 this moved the *content* of our static `grub.cfg` into `image.json`)

In the original coreos-assembler design I'd tried to cleanly
separate builds of the ostree from disk image builds, but also
support linking them together (with matching version numbers, etc.)
The separate `image.yaml` was part of this.  This...mostly worked.

This change furthers that separation by having image builds input from
*just the ostree commit*.  Crucially we would no longer need the config
git repository to perform an image build.

And this in turn unlocks truly better separating ostree builds from
disk image builds in the pipeline *and* supporting
downstream tooling generating disk images from custom containers.

One neat thing here is we will finally fix a longstanding issue
where coreos-assembler fails when just the `image.yaml` changes:

Closes: coreos#972
For now, we need to handle this.  But I think we can switch
back to hard requiring it later, which will happen naturally
when we rework the build system to more cleanly separate
ostree container builds and disk image builds.

Closes: coreos#2885
If the cache was nuked, we should re-download everything even if
nothing changed since the last build. Otherwise, it's impossible to do a
`cosa build --force` afterwards.
This inlines back `impl_rpmostree_compose` in the two places where it's
called.

Prep for future patch.
Now that the variable is declared publicly, we can just add it to the
list of rpm-ostree args as part of the compose helper.

This has no functional change in the `cosa build` path. In the
`cosa fetch` path, this will add the switch but rpm-ostree will ignore
it since we're not actually writing a commit.
All the other overlays have the prefix `overlay/`. Do that here too for
consistency.
Instead of having rpm-ostree effectively doing a `pull-local` under the
hood over 9p, change things so we compose in a local repo, export the
commit to an archive and only copy *that* over 9p.

This should greatly help with pipelines hitting ENOMEM due to
transferring many small files over 9p:

openshift/os#594

An initial approach exported to OCI archive instead, but encapsulating
and unencapsulating are more expensive operations. Unfortunately, we
still need to unpack into `tmp/repo` given that many follow-up commands
expect the commit to be available in `tmp/repo`. If we can drop that
assumption (and get rid of `tmp/repo` entirely), we could refactor
things further so that `compose.sh` creates the final chunked OCI
artifact upfront.

In local testing, this adds about 30s to `cosa build`. We still compress
just once, and we get hardlinks pulling out of the tarball.
@travier
Copy link
Member Author

travier commented Jul 7, 2022

@travier
Copy link
Member Author

travier commented Jul 7, 2022

Tagging @jlebon as I'm not sure this is complete

@miabbott
Copy link
Member

miabbott commented Jul 7, 2022

This is starting to look like a big backport.

Maybe do the 9p changes as a separate backport?

@cgwalters
Copy link
Member

We could also force reset the branch to current main...a similar debate in coreos/rpm-ostree#3819

Maybe do the 9p changes as a separate backport?

No objection from me, but it also seems to me we need the 9p fixes.

CI here failed on:

 --- FAIL: ext.config.shared.root-reprovision.raid1 (1073.08s)
        harness.go:103: TIMEOUT[15m0s]: ssh: journalctl -t kola-runext-test.sh
--- FAIL: coreos.boot-mirror (981.98s)
        boot-mirror.go:105: machine "3a25a671-62b5-4fa2-ba44-0f8f0880fdbc" failed to start: ssh journalctl failed: time limit exceeded 

Hmm, though it seems like they succeeded in a retest?

/retest

@travier
Copy link
Member Author

travier commented Jul 7, 2022

I'm leaning toward a force reset too to the first commit that has all the changes that we need.

@miabbott
Copy link
Member

miabbott commented Jul 7, 2022

I'm leaning toward a force reset too to the first commit that has all the changes that we need.

I'm really gun shy about doing a force reset on a branch that should be in code freeze. We'd need to do incredibly thorough testing/analysis to ensure we don't introduce any unwanted changes.

@cgwalters
Copy link
Member

Yeah, let's get this merged now since we have green CI. We can choose to do a force push later too.

@cgwalters cgwalters merged commit 8c6da5b into coreos:rhcos-4.11 Jul 7, 2022
@travier travier deleted the rhcos-4.11-ci-bp branch July 11, 2022 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants