Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move and rename repos, upgrade to Catalyst 4, support SDK on arm64 #2093

Closed
wants to merge 6 commits into from

Conversation

chewi
Copy link
Contributor

@chewi chewi commented Jul 5, 2024

Move and rename repos, upgrade to Catalyst 4, support SDK on arm64

Sorry for doing all this in one giant commit, but it was hard to separate it out. In fact, it was so big that it made the GitHub UI unresponsive, so I had to create this PR using the CLI tool!

We had no arm64 SDK, so some cross-compiling or emulation was most likely going to be needed to produce one. Catalyst 4 adds support for building with QEMU, so I looked into upgrading. This turned out to be very much slower than emulating the amd64 SDK on arm64, where an arm64 build could then be mostly run without emulation. We can't stay on Catalyst 3 forever though, so I continued with the upgrade.

Catalyst 4 has totally changed the way repositories are handled. It only works when the name of the directory containing the repository matches the configured name of that repository. This was not the case for us, with the coreos repository residing in the coreos-overlay directory. We wanted to move and rename our repositories anyway, so they are now known as gentoo-subset and flatcar-overlay, and they live under scripts/repos. Using the same name as upstream Gentoo would have been problematic, and just "flatcar" would have looked awkward in documentation.

Please see the commit messages for more detail. We will need some coordination to get new SDKs published once this is merged because the usual process won't work.

How to use

As you might expect, this is a breaking change for building the SDK, but building a new amd64 SDK with an existing amd64 SDK doesn't require much effort.

sudo ln -snf ../../mnt/host/source/src/scripts/repos/flatcar-overlay/profiles/coreos/amd64/sdk /etc/portage/make.profile
sudo tee /etc/portage/repos.conf/coreos.conf <<EOF
[DEFAULT]
main-repo = gentoo-subset

[flatcar-overlay]
location = /mnt/host/source/src/scripts/repos/flatcar-overlay

[gentoo-subset]
location = /mnt/host/source/src/scripts/repos/gentoo-subset
EOF

sudo emerge -av catalyst
sudo ./bootstrap_sdk

Obviously, we haven't published an arm64 SDK yet, but I can provide one if you want to test that.

Testing done

I've built an arm64 SDK from scratch, including the step to turn it into a Docker image, and built another arm64 SDK from that. I've also built a new amd64 SDK, using an existing 7 month old amd64 SDK as a seed.

  • Changelog entries added in the respective changelog/ directory (user-facing change, bug fix, security fix, update)
  • Inspected CI output for image differences: /boot and /usr size, packages, list files for any missing binaries, kernel modules, config files, kernel modules, etc.

chewi added 6 commits July 5, 2024 22:53
Sorry for doing all this in one giant commit, but it was hard to
separate it out.

We had no arm64 SDK, so some cross-compiling or emulation was most
likely going to be needed to produce one. Catalyst 4 adds support for
building with QEMU, so I looked into upgrading. This turned out to be
very much slower than emulating the amd64 SDK on arm64, where an arm64
build could then be mostly run without emulation. We can't stay on
Catalyst 3 forever though, so I continued with the upgrade.

Despite being slow, I have kept support for building with QEMU using
Catalyst since it requires little code and may be useful to somebody.

Catalyst 4 has totally changed the way repositories are handled. It only
works when the name of the directory containing the repository matches
the configured name of that repository. This was not the case for us,
with the coreos repository residing in the coreos-overlay directory. We
wanted to move and rename our repositories anyway, so they are now known
as gentoo-subset and flatcar-overlay, and they live under scripts/repos.
Using the same name as upstream Gentoo would have been problematic, and
just "flatcar" would have looked awkward in documentation.

Catalyst 4 also ingests the main repository snapshot as a squashfs
rather than a tarball. It features a utility to generate such a
snapshot, but it doesn't fit Flatcar well, particularly because it
expects each ebuild repository to reside at the top level of its own git
repository. It was very easy to call tar2sqfs manually though.

There were several places where we assumed that amd64 was native and
arm64 required emulation via QEMU. The scripts are now more
architecture-agnostic, paving the way for riscv support later.

We no longer set QEMU_LD_PREFIX because it prevents the SDK itself from
being emulated. It also assumes there is only one non-native target,
which won't be the case soon. bubblewrap does a better job of running
binaries under QEMU.

Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
This hasn't been needed for a while, and it now breaks util-linux,
installing modules under /usr/lib64 when they should be under /usr/lib.

Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
This is what upstream Gentoo does. They would previously update the
entire seed, but this took a long time. Our seeds are much bigger, so we
kept repo snapshots to build stage1 against these instead. The new
method of only rebuilding packages with changed sub-slots is a good
compromise and removes the need to write stage1 hooks that selectively
catch the repository up.

This also avoids some conflicts by adding the `--ignore-world` option.
Gentoo seeds have nothing in @world. We have much more, but none of that
is needed for stage1.

This continues to exclude cross-*-cros-linux-gnu/* as that is not needed
for stage1. It now also excludes dev-lang/rust, because it is never a
DEPEND, so it would not break other packages in this way. It may fail to
run due to a sub-slot change in one of its own dependencies, but it is
also unlikely to be needed in stage1 and it is not configured to use the
system LLVM. If needs be, we could improve the behaviour of Portage's
@changed-subslot to respect `--with-bdeps`.

In my testing, it was unable to handle an SDK from 17 months ago, but
one from 7 months ago did work. In practise, we will always use a much
more recent one, which is far more likely to work.

Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
From https://wiki.gentoo.org/wiki/Catalyst/Stage_Creation#Build_Stage3:

> It is not necessary to build stage2 in order to build stage3. Gentoo
> release engineering does not build stage2, and you should not need to
> unless you're intentionally building a stage2 as your goal.

Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
We stopped using profiles with a lib->lib64 symlink a while ago, so
there is no point in checking for this any more. We weren't checking
against the target SDK architecture anyway.

Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
We currently put an os-release symlink in lib64, but we shouldn't assume
that the architecture will even have a lib64 directory. I doubt this
compatibility symlink was needed anyway. Gentoo doesn't have one, and
applications are supposed to check /etc/os-release. I can find almost no
reference to /usr/lib64/os-release anywhere, let alone in Flatcar.

Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
@chewi chewi requested a review from a team July 5, 2024 22:22
@chewi chewi mentioned this pull request Jul 5, 2024
2 tasks
@ader1990
Copy link
Contributor

ader1990 commented Jul 8, 2024

Hello @chewi, great work on the move and the ARM64 SDK.

I have tried to reproduce the build, but got some issues:

git clone https://github.com/flatcar/scripts -b chewi/repo-mv-catalyst4-arm64-sdk
cd scripts
./run_sdk_container -t
# a bunch of errors when starting the sdk container

# ../../mnt/host/source/src/scripts/repos/flatcar-overlay/profiles/coreos/amd64/sdk does not exist
# cd .. or using the full path not relative, works
# /mnt/host/source/src/scripts/repos/flatcar-overlay/profiles/coreos/amd64/sdk
sudo ln -snf ../../mnt/host/source/src/scripts/repos/flatcar-overlay/profiles/coreos/amd64/sdk /etc/portage/make.profile

When running catalyst emerge:

!!! Section 'coreos' in repos.conf has location attribute set to nonexistent directory: '/mnt/host/source/src/third_party/coreos-overlay'
!!! Section 'portage-stable' in repos.conf has location attribute set to nonexistent directory: '/mnt/host/source/src/third_party/portage-stable'
Unavailable repository 'portage-stable' referenced by masters entry in '/usr/local/portage/crossdev/metadata/layout.conf'
Unavailable repository 'coreos' referenced by masters entry in '/usr/local/portage/crossdev/metadata/layout.conf'
!!! Unable to parse profile: '/etc/portage/make.profile'
!!! ParseError: Parent 'gentoo-subset:default/linux/amd64/17.1/no-multilib/hardened' not found: '/mnt/host/source/src/scripts/repos/flatcar-overlay/profiles/coreos/amd64/parent'


!!! /etc/portage/make.profile is not a symlink and will probably prevent most merges.
!!! It should point into a profile within /profiles/
!!! (You can safely ignore this message when syncing. It's harmless.)


!!! Your current profile is invalid. If you have just changed your profile
!!! configuration, you should revert back to the previous configuration.
!!! Allowed actions are limited to --help, --info, --search, --sync, and
!!! --version.

What seed are you using, maybe my recent AMD64 SDK flatcar-sdk-all-4019.0.0-nightly-20240702-2100_os-main-4019.0.0-nightly-20240702-2100 is too new?

Copy link

github-actions bot commented Jul 8, 2024

@dongsupark dongsupark added the main label Jul 8, 2024
@jepio
Copy link
Member

jepio commented Jul 8, 2024

We will need some coordination to get new SDKs published once this is merged because the usual process won't work.

Why not? This is going to be a deal breaker - there must be a clean transition path from one sdk version to the next. update_chroot is a good spot to perform migrations.

This PR is a lot of changes squeezed together - my initial thought is that we need to split this into stages:

  1. repo migration (maybe first sdk lib/lib64 migration)
  2. catalyst 4
  3. arm64

I've built an arm64 SDK from scratch, including the step to turn it into a Docker image, and built another arm64 SDK from that. I've also built a new amd64 SDK, using an existing 7 month old amd64 SDK as a seed.

What did you use as seed for "from scratch"? How about finding a way to do it with https://alpha.release.flatcar-linux.net/arm64-usr/current/flatcar_developer_container.bin.bz2 which is based on flatcar production images but has a full toolchain + emerge.

Copy link
Member

@jepio jepio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unable to comment on the gigantic commit: the qemu-user emulation is used when cross-compiling (bootengine and i'm sure some other packages run compiled helpers) so QEMU_LD_PREFIX and binfmt are still needed to stay. amd64->arm64 cross-compilation is still going to have to stay supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this it is not obvious to someone why this is no longer needed and how it breaks something. Which "modules" are installed in /usr/lib64? Why hasn't this been needed for a while? Error messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want the cross-compilation to stay, but I wasn't aware that it relied on QEMU like this. We'd have to find a different way of doing it. I've been brewing up an eclass to help with cases like this, so that's one option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the Python modules: The hack is not needed now because the way Gentoo handles Python modules has totally changed since this was written. This was largely driven by PEP 517, which made it much easier for Gentoo to cross-compile Python modules. The util-linux package installs the libmount module. It's only now needed because of Catalyst 4. We previously disabled the python USE flag.

@jepio
Copy link
Member

jepio commented Jul 8, 2024

  1. repo migration (maybe first sdk lib/lib64 migration)

https://github.com/flatcar/flatcar-dev-util/blob/flatcar-master/emerge-gitclone relies on the repo paths and names - this is used by the devcontainer.

In phase 1 we don't necessarily need to move or rename the repos, right? We just need to align metadata with the existing directory name.

@ader1990
Copy link
Contributor

ader1990 commented Jul 8, 2024

From my experience, when we were doing this kind of heavy PRs, we would usually create multiple PRs that were doing only one thing, without any of those PRs being functional per se, and then have one PR (meta-PR) that contained all those PRs to be able to test the full functionality. Once the independent PRs were reviewed / comented and agreed upon in isolation and the meta-PR was tested, we were merging the independent PRs at the same time, in the correct order and closed the meta-PR. Thus, the comments and reviews / updates were trackable in isolation and we had the full view in the meta-pr.

@chewi
Copy link
Contributor Author

chewi commented Jul 8, 2024

@ader1990 I just tried my steps in 4020.0.0+nightly-20240703-2100 and it worked fine, but I was previously using a much older version. I'll give you a call later to see what's up.

@chewi
Copy link
Contributor Author

chewi commented Jul 8, 2024

What did you use as seed for "from scratch"?

I used my cross-boss project to cross-compile @system (effectively stage3) using Flatcar's repos from the amd64 SDK to arm64. I had to hack up some symlinks so that aarch64-cros-linux-gnu became aarch64-unknown-linux-gnu, although I could have just built a whole new toolchain. I could have even done this on plain Gentoo rather than in the SDK. cross-boss usually completes this in one pass, but our USE flags meant there were some conflicts that I had to work through manually. That was simply a case of running (cb-)emerge for some package with some flag disabled and then getting cross-boss to continue from where it left off. I also needed to add git, which isn't part of @system.

With the other changes I've made, using the dev container might just work now, but I haven't looked at it yet. I suspect it has the same "cros" toolchain?

@ader1990
Copy link
Contributor

ader1990 commented Jul 9, 2024

Hello,

I could reproduce a succesful AMD64 SDK build and a Flatcar image creation with that SDK.
The workflow is as follows:

# First step, clone this branch and enter the latest sdk container
# Step done on the AMD64 host
git clone https://github.com/flatcar/scripts -b chewi/repo-mv-catalyst4-arm64-sdk
cd scripts
./run_sdk_container -t

# Second step, make sure that the gentoo portage profiles are properly set
# Step done on the SDK container
sudo ln -snf  /mnt/host/source/src/scripts/repos/flatcar-overlay/profiles/coreos/amd64/sdk /etc/portage/make.profile

sudo tee /etc/portage/repos.conf/coreos.conf <<EOF
[DEFAULT]
main-repo = gentoo-subset

[flatcar-overlay]
location = /mnt/host/source/src/scripts/repos/flatcar-overlay

[gentoo-subset]
location = /mnt/host/source/src/scripts/repos/gentoo-subset
EOF

# Third step, emerge catalyst and bootstrap the SDK
# Step done on the SDK container
sudo emerge -av catalyst
sudo ./bootstrap_sdk

# Once the bootstrap in complete, an tar.bzip2 artifact should be present here
# /mnt/host/source/src/build/catalyst/builds/flatcar-sdk/flatcar-sdk-amd64-*tar.bz2

# 4th step, copy the artifact in the scripts folder (cwd) and exit the initial SDK container
sudo cp  /mnt/host/source/src/build/catalyst/builds/flatcar-sdk/flatcar-sdk-amd64-*tar.bz2 .
exit

# 5th step, build the Docker image of that SDK
# Step done on the AMD64 host
./build_sdk_container_image flatcar-sdk-amd64-*.tar.bz2
# After the build completes, you should see 3 new docker images, with suffix amd64/arm64 and all
# docker image ls
# REPOSITORY                          TAG                                            IMAGE ID       CREATED        SIZE
# ghcr.io/flatcar/flatcar-sdk-amd64   4020.0.0-nightly-20240703-2100-6-ga344b7edca   ada50a52899c   12 hours ago   6.05GB
# ...

# 6th step, enter the new SDK container 
# Step done on the AMD64 host
./run_sdk_container -t -C ghcr.io/flatcar/flatcar-sdk-amd64:4020.0.0-nightly-20240703-2100-6-ga344b7edca -n test-amd64-new-sdk -a amd64

# 7th step, build_packages, build_image, image_to_vm, boot the image on qemu to make sure it works
# Step done on the new SDK container
./build_packages
./build_image --image_compression_formats none
./image_to_vm.sh --from=../build/images/amd64-usr/developer-4020.0.0+nightly-20240703-2100-6-ga344b7edca-a1 --board=amd64-usr --image_compression_formats none
cd ../build/images/amd64-usr/developer-4020.0.0+nightly-20240703-2100-6-ga344b7edca-a1/
sudo bash ./flatcar_production_qemu_uefi_secure.sh -nographic

# Make sure that the QEMU VM autologins with user core and `systemctl status --failed` returns an empty response.

@chewi
Copy link
Contributor Author

chewi commented Jul 9, 2024

I've prepared a branch with just the Catalyst 4 upgrade and am testing it out. It still requires two manual adjustments though.

One can be avoided by also taking the change to not use snapshots in stage1. This isn't a very large change.

The other is the [coreos] in /etc/portage/repos.conf/coreos.conf changing to [coreos-overlay]. There is an aliases setting you can put in the repo's layout.conf, and I had hoped that would prevent the need for this adjustment, but Portage doesn't respect it in this context. I have tweaked Portage to make that work though, so I'll submit that upstream, and then maybe we can make this smoother, but we'll have to wait for the Portage update to hit the next SDK before we can proceed further.

@jepio
Copy link
Member

jepio commented Jul 9, 2024

The other is the [coreos] in /etc/portage/repos.conf/coreos.conf changing to [coreos-overlay]. There is an aliases setting you can put in the repo's layout.conf, and I had hoped that would prevent the need for this adjustment, but Portage doesn't respect it in this context. I have tweaked Portage to make that work though, so I'll submit that upstream, and then maybe we can make this smoother, but we'll have to wait for the Portage update to hit the next SDK before we can proceed further.

What's the issue with updating coreos.conf to say [coreos-overlay]?

@chewi
Copy link
Contributor Author

chewi commented Jul 9, 2024

If you mean scripting that up for a clean transition, then yeah, we could do that, but I like to avoid short-term fix-ups and this isn't a bad thing to add to Portage anyway.

@jepio
Copy link
Member

jepio commented Jul 9, 2024

If you mean scripting that up for a clean transition, then yeah, we could do that, but I like to avoid short-term fix-ups and this isn't a bad thing to add to Portage anyway.

As far as i can tell this file is generated during running of catalyst and is (/can be) regenerated when entering the SDK. What am I missing?

@chewi
Copy link
Contributor Author

chewi commented Jul 9, 2024

The SDK's repo config needs to be correct in order to update Catalyst, although that in itself is currently a manual step. That aside, if you don't also apply the two lib64 symlink commits, then bootstrap_sdk dies almost immediately. Even after that, you still get a couple of warnings about it, although they're possibly benign.

@ader1990
Copy link
Contributor

ader1990 commented Jul 10, 2024

How to reproduce the ARM64 build on a ARM64

Pre-requisites:

  • Get an ARM64 SDK docker image from @chewi (to be explained in detail on how it was created).
  • Install Flatcar on a ARM64 box and add the kubernetes sysext (or docker + containerd)
# load the ARM64 SDK docker image
# Step run on the Flatcar ARM64 host
docker load < magic-image.tar.gz

docker image ls
# Step run on the Flatcar ARM64 host
# ghcr.io/flatcar/flatcar-sdk-all   4006.0.0-nightly-20240619-2100-28-g922276c37f   857f97cb44b6   4 days ago     7.5GB

# enter the SDK container
# Step run on the Flatcar ARM64 host
./run_sdk_container -n arm64sdkv1 -t -C ghcr.io/flatcar/flatcar-sdk-all:4006.0.0-nightly-20240619-2100-28-g922276c37f

# build packages / image / vm image and run vm
# Step run on the SDK ARM64 container
./build_packages --board arm64-usr
./build_image --board arm64-usr
./image_to_vm --from <image dir>
./flatcar_production_qemu_uefi.sh -nographic

@chewi
Copy link
Contributor Author

chewi commented Jul 10, 2024

Ahem, you just pinged somebody else there. 😅

@ader1990
Copy link
Contributor

Ahem, you just pinged somebody else there. 😅

copy paste from notepad did not work so well, sorry.

@chewi
Copy link
Contributor Author

chewi commented Jul 10, 2024

My Portage fix has now been merged. I'm one of the maintainers, so I can cut a release. There's just one other change we'd like to get in first.

If I then create a new PR here to take that release, it will presumably then make it into nightly SDK builds. Is that good enough to ensure a clean transition, or would it also need to hit the release channels first? It would be helpful to understand this process when making other fixes in future.

@jepio
Copy link
Member

jepio commented Jul 11, 2024

If I then create a new PR here to take that release, it will presumably then make it into nightly SDK builds. Is that good enough to ensure a clean transition, or would it also need to hit the release channels first? It would be helpful to understand this process when making other fixes in future.

The next release is built with the previous release's SDK as a seed, so that's the scenario that needs to be tested in the CI. The nightly SDK build follows that process but getting something into the nightly SDK only lets you depend on it for package builds. We have maintainer docs here: https://github.com/flatcar/flatcar-maintainer-private/blob/main/documentation/maintenance/release.md.

As for the "clean transition" - it depends. I'm not sure which changes you want to split out - it'll be better to discuss this once you open the PR.

With my rough understanding of what your intention is: if you enable updating the seed in catalyst and change the (generated) repos.conf definition to one that the seed's portage doesn't understand, then I expect that to fail. That's why I suggested not going down that road and updating the section in the config to match the folder name instead.

@jepio
Copy link
Member

jepio commented Jul 11, 2024

How to reproduce the ARM64 build on a ARM64

Pre-requisites:

  • Get an ARM64 SDK docker image from @chewi (to be explained in detail on how it was created).

We will want the steps that produce the initial arm64 seed to be merged into this repo and setup a jenkins job for it.

@chewi
Copy link
Contributor Author

chewi commented Jul 11, 2024

The next release is built with the previous release's SDK as a seed, so that's the scenario that needs to be tested in the CI.

Okay, but which SDK version actually kicks off the build? It is the same as the version used for the seed?

@chewi
Copy link
Contributor Author

chewi commented Jul 11, 2024

Okay, but which SDK version actually kicks off the build? It is the same as the version used for the seed?

At least for the Jenkins sdk pipeline, it seems to use the same version as the seed.

In any case, I've decided not to wait for the Portage fix to land. I've now created #2115, which automatically fixes up the repo name when the SDK starts and upgrades Catalyst when the SDK build starts.

@chewi
Copy link
Contributor Author

chewi commented Jul 15, 2024

#2115 has been merged and #2121 has been created for the repo move. I'll close this now.

@chewi chewi closed this Jul 15, 2024
@chewi chewi deleted the chewi/repo-mv-catalyst4-arm64-sdk branch July 15, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants