cluster installation fails using agent-based installer #1608

alexk201 · 2023-05-20T20:00:47Z

Describe the bug
I am unable to install OKD on a VM with the agent based installer. During the installation, I always receive the following error:

INFO Host okd-sno: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
INFO Cluster installation in progress
INFO Host okd-sno: updated status from preparing-successful to installing (Installation is in progress)
INFO Host: okd-sno, reached installation stage Installing: bootstrap
INFO Host: okd-sno, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/okd/scos-content@sha256:4e5abfb4ca9de3c43f6a724f489f34bb3e53ff932e1f0887b9489303e98c88b8 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists"
INFO Cluster has hosts in error
INFO cluster has stopped installing... working to recover installation

Since I can monitor the installation with openshift-install, ssh into the vm etc, I find it unlikely to be a config error but maybe I'm wrong here. I can confirm that the file "/run/ostree/auth.json" is created moments before the error occurs. It's also reproducible when manually running the command on the VM:

$ sudo podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/okd/scos-content@sha256:4e5abfb4ca9de3c43f6a724f489f34bb3e53ff932e1f0887b9489303e98c88b8 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot
I0520 19:47:26.710198   18107 start.go:96] Version: machine-config-daemon-4.6.0-202006240615.p0-2008-g70aa0a56-dirty (70aa0a560c0b0a01093f695cb358a8749d30b3d2)
I0520 19:47:26.710219   18107 start.go:109] Calling chroot("/rootfs")
F0520 19:47:26.710595   18107 start.go:137] Failed to initialize single run daemon: error initializing rpm-ostree: Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists

When I manually delete the file and run the command again, it kinda works at first, then fails with a different error:

[...]
I0520 19:48:31.938276   18522 update.go:1484] Preset systemd unit zincati.service
I0520 19:48:31.938316   18522 file_writers.go:223] Writing systemd unit "install-to-disk.service"
F0520 19:48:31.947116   18522 start.go:145] error enabling units: Failed to enable unit: Unit file systemd-journal-gatewayd.socket does not exist.

Version
I tried multiple versions including various releases, down from 4.12.0-0.okd-2023-03-18-084815 up to 4.13.0-0.okd-scos-2023-05-04-192252 (what does the scos stand for btw?!)

How reproducible
100% reproducible

alexk201 · 2023-05-20T21:33:21Z

I don't think it's related to my problem but I get the following when SSHing into the VM:

[systemd]
Failed Units: 1
  selinux.service
[core@okd-sno ~]$ journalctl -b -u selinux.service
May 20 21:17:05 localhost systemd[1]: Starting selinux.service...
May 20 21:17:05 localhost systemd[1152]: selinux.service: Failed to locate executable checkmodule: No such file or directory
May 20 21:17:05 localhost systemd[1152]: selinux.service: Failed at step EXEC spawning checkmodule: No such file or directory
May 20 21:17:05 localhost systemd[1]: selinux.service: Control process exited, code=exited, status=203/EXEC
May 20 21:17:05 localhost systemd[1]: selinux.service: Failed with result 'exit-code'.
May 20 21:17:05 localhost systemd[1]: Failed to start selinux.service.

vrutkovs · 2023-05-22T07:51:15Z

Which FCOS are you using for discovery ISO? See recommended values at https://github.com/openshift/assisted-service/blob/master/deploy/podman/okd-configmap.yml#L30-L31

I don't think it's related to my problem but I get the following when SSHing into the VM:

This is known (might worth a separate bug): assisted-installer tries to set custom selinux rules, but needs "checkmodule" binary, which is available in RHCOS and not in FCOS/SCOS

what does the scos stand for btw?!

(centos) Stream Core OS

alexk201 · 2023-05-22T08:59:41Z

TL;DR: I use FCOS Version 37.20221127.3.0

Detail:
I created a new ISO for release 4.12.0-0.okd-2023-03-18-084815, since this is the version listed in the recommended values you provided.

I downloaded the client tooling for that release again:

$ openshift-install version
openshift-install 4.12.0-0.okd-2023-03-18-084815
built from commit 4688870d3a709eea34fe2bb5d1c62dea2cfd7e91
release image quay.io/openshift/okd@sha256:7153ed89133eeaca94b5fda702c5709b9ad199ce4ff9ad1a0f01678d6ecc720f
release architecture amd64

Then I created the ISO using openshift-install agent create image and started the installation.

cat /etc/*release
Fedora release 37 (Thirty Seven)
NAME="Fedora Linux"
VERSION="37.20221127.3.0 (CoreOS)"
ID=fedora
VERSION_ID=37
VERSION_CODENAME=""
PLATFORM_ID="platform:f37"
PRETTY_NAME="Fedora CoreOS 37.20221127.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:37"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=37
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=37
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='37.20221127.3.0'
Fedora release 37 (Thirty Seven)
Fedora release 37 (Thirty Seven)

This version does not match the 'machine-os' component version from https://github.com/okd-project/okd/releases/tag/4.12.0-0.okd-2023-03-18-084815 (37.20230218.3). I thought the openshift-installer would download the FCOS version matching the OKD release or am I missing something?

alexk201 · 2023-05-22T09:13:11Z

I tried overriding the base ISO by manually downloading FCOS 37.20221225.3.0 and placing it in /root/.cache/agent/image_cache/coreos-x86_64.iso, but the installer does not accept it and overwrites it:

$ mkdir -p /root/.cache/agent/image_cache/
$ wget -qq https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/37.20221225.3.0/x86_64/fedora-coreos-37.20221225.3.0-live.x86_64.iso -O /root/.cache/agent/image_cache/coreos-x86_64.iso

$ openshift-install agent create image
[...]
msg=The file was found in cache: /root/.cache/agent/image_cache/coreos-x86_64.iso
level=info msg=Verifying cached file
level=debug msg=extracting /coreos/coreos-x86_64.iso.sha256 to /tmp/cache849938469, oc image extract --path /coreos/coreos-x86_64.iso.sha256:/tmp/cache849938469 --confirm --icsp-file=/tmp/icsp-file4064866473 quay.io/openshift/okd-content@sha256:feebbf5fdeaebbe67347255ee1aeef93e2e1da0e6da966deaaa095589cf0373d
level=debug msg=Cached file /root/.cache/agent/image_cache/coreos-x86_64.iso is not most recent
level=debug msg=extracting /coreos/coreos-x86_64.iso to /root/.cache/agent/image_cache, oc image extract --path /coreos/coreos-x86_64.iso:/root/.cache/agent/image_cache --confirm --icsp-file=/tmp/icsp-file4243370380 quay.io/openshift/okd-content@sha256:feebbf5fdeaebbe67347255ee1aeef93e2e1da0e6da966deaaa095589cf0373d
level=info msg=Base ISO obtained from release and cached at /root/.cache/agent/image_cache/coreos-x86_64.iso

Is there a way to override the ISO image? Why does the installer even want to use an older image?

vrutkovs · 2023-05-22T09:28:15Z

This version does not match the 'machine-os' component version from 4.12.0-0.okd-2023-03-18-084815 (release) (37.20230218.3). I thought the openshift-installer would download the FCOS version matching the OKD release or am I missing something?

These may be different. Installer has a hardcoded list of images to use, so it may be different from what machine-os-content is based at. If the machine boots rpm-ostree will update from installer ISO to machine-os contents - but in this case this operation is stopped due to a bug in installer-provided ISO.

There is a way to override installer initial ISO.

This is already fixed in master (openshift/installer#6902 makes installer use latest F37 image with auth bug fixed), but it has not yet reached 4.13 or 4.12 yet, sorry. So for now its recommended to override initial FCOS

alexk201 · 2023-05-22T10:19:08Z

It seems like openshift-install does not pick up the mentioned environment variable (on my setup).
I tried to export a couple of different environment variable values for OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE, but they are all ignored:

"https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/37.20230322.3.0/x86_64/fedora-coreos-37.20230322.3.0-live.x86_64.iso"
"file://fcos.iso" (downloaded the file beforehand)
"file:///opt/fcos.iso" (downloaded the file beforehand)
"https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.12/4.12.0/rhcos-qemu.x86_64.qcow2.gz"
"file://rhcos.qcow2.gz" (downloaded the file beforehand)
"file://rhcos.qcow2" (downloaded and unzipped the file beforehand)

The installer still uses the image extraced from the okd release

level=debug msg=Fetching image from OCP release (oc adm release info --image-for=machine-os-images --insecure=true --icsp-file=/tmp/icsp-file304177815 quay.io/openshift/okd@sha256:2b3d90157565bb1e227c1cd182154b498c4cf76360d8a57cc5d6d5a4a63794cb)
level=debug msg=extracting /coreos/coreos-x86_64.iso to /root/.cache/agent/image_cache, oc image extract --path /coreos/coreos-x86_64.iso:/root/.cache/agent/image_cache --confirm --icsp-file=/tmp/icsp-file685273286 quay.io/openshift/okd-content@sha256:feebbf5fdeaebbe67347255ee1aeef93e2e1da0e6da966deaaa095589cf0373d
level=info msg=Base ISO obtained from release and cached at /root/.cache/agent/image_cache/coreos-x86_64.iso
level=debug msg=Extracted base ISO image /root/.cache/agent/image_cache/coreos-x86_64.iso from release payload

alexk201 · 2023-05-22T11:36:11Z

Is the agent-based installer gonna be the preferred installation method for disconnected environments or could the on premise assisted installer (as described in your blog post here https://vrutkovs.eu/posts/okd-disconnected-assisted/) be an alternative? The company I'm working at want to migrate to K8s with OpenShift being the preferred platform, but with strict network limitations (basically fully disconnected)

vrutkovs · 2023-05-22T11:50:22Z

If you're planning to install more cluster and manage them afterwards on-premise assisted installer would be preferred. Agent installer is more of a one-shot install for "cluster 0" to run apps / host Assisted Service for other cluster and so on

vrutkovs · 2023-05-22T11:53:28Z

@andfasano do you remember if OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE can be applied or there is any other way to override ISO in machine-os-images for OKD?

andfasano · 2023-05-22T13:27:56Z

Note: at the current moment ABI integration with OKD is not working. I've been able to prepare a successfull working PoC here openshift/installer#7112 using an SCOS image for the connected environment - but for the disconnected environment we'll need to support such image in machine-os-images (cc @sherine-k ).
For what regards FCOS image instead didn't work in my latest tests.
Looks like OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE is not honored by ABI, I'll open a separate PR for that.

Btw, just for clarity, ABI is not just for one-shot install, but aims to bring in the ease of use of Assisted Installer experience especially for disconnected environment, offering at the same time a full automation and simplicity of use. For more info see also https://cloud.redhat.com/blog/meet-the-new-agent-based-openshift-installer-1

alexk201 · 2023-05-22T17:51:11Z

Maybe I misunderstood something but shouldn't it be possible to create a fix / update for the openshift-installer with a later version of the fcos.json to resolve my issue?

I actually tried this and simply replaced the contents of https://github.com/openshift/installer/blob/release-4.12/data/data/coreos/fcos.json with the latest version https://github.com/openshift/installer/blob/master/data/data/coreos/fcos.json. I can build and run the installer, but it then selects a release from rhcos.json. The generated ISO is bootable but creates a different error regarding a missing cluster-id.

What I tried:

git clone https://github.com/openshift/installer.git
git checkout 4688870d3a # I used the exact same commit from the latest 4.12 okd release just to be sure...
rm data/data/coreos/fcos.json
wget https://raw.githubusercontent.com/openshift/installer/master/data/data/coreos/fcos.json -O data/data/coreos/fcos.json
bash hack/build.sh

I guess I built the binary for openshift, not OKD. There seems to be an isOKD switch but I haven't figured out how to use it..

vrutkovs · 2023-05-22T19:14:07Z

env TAGS=okd hack/build.sh should do it (see here)

alexk201 · 2023-05-22T22:27:37Z

Exporting the TAGS environment variable solved the issue and I am actually able to build a modified version of the openshift-install cli with the updated FCOS image hashes:

level=debug msg=The file was found in cache: /root/.cache/agent/image_cache/coreos-x86_64.iso
level=info msg=Verifying cached file
level=debug msg=Found matching hash in installer metadata
level=info msg=Using cached Base ISO /root/.cache/agent/image_cache/coreos-x86_64.iso

The assisted installer is now running on the FCOS I specified, latest attempt with the new 4.13 release: Fedora CoreOS 38.20230414.3.0.

But there's still another issue: The generated ISO tries to install OCP, not OKD. Since I don't have credentials set up for OCP, I get the following error:

pull secret for new cluster is invalid: pull secret must contain auth for \"registry.ci.openshift.org\"

I already tried overriding OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE and specifying a custom cluster-image-set.yaml. This results in a different error:

Error: No OS image for Openshift version 4.13.0-0.okd-2023-05-22-052007 and architecture x86_64: The requested OS image for version (4.13.0-0.okd-2023-05-22-052007) and CPU architecture (x86_64) isn't specified in OS images list" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterInfraEnvInternal.func1"

Maybe you've got another hint for me?!

vrutkovs · 2023-05-23T07:50:20Z

Try running the prebuilt installer with OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/openshift/okd:4.13.0-0.okd-2023-05-22-052007 env var?

alexk201 · 2023-05-23T07:58:46Z

The prebuilt installer still uses FCOS 37.20221127.3.0 (https://github.com/openshift/installer/blob/release-4.13/data/data/coreos/fcos.json)

alexk201 · 2023-05-23T08:32:41Z

I tried it anyways and got the same error that I started with, auth.json: file exists

andfasano · 2023-05-23T12:44:29Z

For what regards FCOS image instead didn't work in my latest tests.

I don't think that part has been fixed yet. SCOS instead works fine

alexk201 · 2023-05-23T14:38:58Z

now using quay.io/okd/scos-content@sha256:116b7b210b1c1fd43fb9974e32c0c4923f29a3b581f444f8e33452cd9ad26ea4 with the same result

INFO Host okd-sno: New image status quay.io/okd/scos-content@sha256:116b7b210b1c1fd43fb9974e32c0c4923f29a3b581f444f8e33452cd9ad26ea4. result: success. time: 2.88 seconds; size: 416.16 Megabytes; download rate: 151.66 MBps
INFO Host okd-sno: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
INFO Cluster installation in progress
INFO Host okd-sno: updated status from preparing-successful to installing (Installation is in progress)
INFO Host: okd-sno, reached installation stage Installing: bootstrap
INFO Host: okd-sno, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/okd/scos-content@sha256:4e5abfb4ca9de3c43f6a724f489f34bb3e53ff932e1f0887b9489303e98c88b8 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists"
INFO Cluster has hosts in error
INFO cluster has stopped installing... working to recover installation

andfasano · 2023-05-23T14:41:36Z

Did you use the code from openshift/installer#7112?

alexk201 · 2023-05-23T14:43:59Z

Nope, I used the tools from oc adm release extract --tools quay.io/okd/scos-release:4.13.0-0.okd-scos-2023-05-04-192252. Should I check out your branch and build the installer myself?

andfasano · 2023-05-23T15:18:11Z

At the current moment it's the only way I tested to be working, if you'd like to try out ABI with OKD. I've anyhow opened this bug https://issues.redhat.com/browse/OCPBUGS-13955 that will make not necessary to rebuild the installer from src

alexk201 · 2023-05-23T18:09:43Z

I tried that, now I'm back to this result:

No OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE:

pull secret for new cluster is invalid: pull secret must contain auth for \"registry.ci.openshift.org\"

OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE quay.io/okd/scos-release:4.13.0-0.okd-scos-2023-05-04-192252

Error: No OS image for Openshift version 4.13.0-0.okd-scos-2023-05-04-192252 and architecture x86_64: The requested OS image for version (4.13.0-0.okd-scos-2023-05-04-192252) and CPU architecture (x86_64) isn't specified in OS images list

It uses SCOS tho:

level=debug msg=Obtaining RHCOS image file from 'https://okd-scos.s3.amazonaws.com/okd-scos/builds/414.9.202304170609-0/x86_64/scos-414.9.202304170609-0-live.x86_64.iso'
level=debug msg=Unpacking file into "/root/.cache/agent/image_cache/scos-414.9.202304170609-0-live.x86_64.iso"...

alexk201 · 2023-05-23T18:12:18Z

Guess I'm just gonna wait for this to be fixed in a future release. Thanks for your help and fast response times tho :)
@vrutkovs @andfasano give these men a raise 🥇

cgruver · 2023-06-13T12:14:03Z

@alexk201 There is another work around that I am trying this morning.

So far it seems to be working. I discovered that I'm going to have to add IP reservations to my router, because nmstatectl will not run on a Mac... :-(

Anyway, I digress. The workaround is to extract the ignition config from the generated ISO, then use that ignition config to boot with the correct OS image from the release bundle.

I'll post more if I'm successful, but this might get you started:

Note: Apologies for all of the env vars. The snippets below are extracted from my lab scripts.

Create the ISO:

openshift-install --dir=${WORK_DIR}/okd-install-dir agent create image

Extract the ignition config:

coreos-installer iso ignition show agent.x86_64.iso > agent-install.ign

Extract the PXE boot artifacts:

KERNEL_URL=$(openshift-install coreos print-stream-json | jq -r '.architectures.x86_64.artifacts.metal.formats.pxe.kernel.location')
  INITRD_URL=$(openshift-install coreos print-stream-json | jq -r '.architectures.x86_64.artifacts.metal.formats.pxe.initramfs.location')
  ROOTFS_URL=$(openshift-install coreos print-stream-json | jq -r '.architectures.x86_64.artifacts.metal.formats.pxe.rootfs.location')

curl -o /usr/local/www/install/fcos/${OKD_RELEASE}/vmlinuz ${KERNEL_URL}
curl -o /usr/local/www/install/fcos/${OKD_RELEASE}/initrd ${INITRD_URL}
curl -o /usr/local/www/install/fcos/${OKD_RELEASE}/rootfs.img ${ROOTFS_URL}

I'm using iPXE, so my iPXE file looks something like:

#!ipxe

kernel http://${INSTALL_HOST_IP}/install/fcos/${OKD_RELEASE}/vmlinuz edd=off net.ifnames=1 ifname=nic0:${mac} ip=${ip_addr}::${DOMAIN_ROUTER}:${DOMAIN_NETMASK}:${hostname}:nic0:none nameserver=${DOMAIN_ROUTER} rd.neednet=1 coreos.inst.install_dev=/dev/sda coreos.inst.ignition_url=http://${INSTALL_HOST_IP}/install/fcos/agent-boot/${CLUSTER_NAME}.${DOMAIN}/agent-install.ign coreos.inst.platform_id=${platform} initrd=initrd initrd=rootfs.img ${CONSOLE_OPT}
initrd http://${INSTALL_HOST_IP}/install/fcos/${OKD_RELEASE}/initrd
initrd http://${INSTALL_HOST_IP}/install/fcos/${OKD_RELEASE}/rootfs.img

boot

andfasano · 2023-06-14T08:38:23Z

fyi openshift/installer#7211 landed so now it could be used as a temporary workaround to deploy okd-scos into a (connected) environment. Note that such approach is meant for test only - but at least a way to try it out until the proper integration with okd will be implemented.

I've been able to successfully setup a cluster by using an installer extracted from 4.14.0-0.okd-scos-2023-06-14-054844:

$ export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=https://okd-scos.s3.amazonaws.com/okd-scos/builds/414.9.202305040609-0/x86_64/scos-414.9.202305040609-0-live.x86_64.iso
$ ./openshift-install agent create image
...
$ oc get clusterversions version                                               
NAME      VERSION                               AVAILABLE   PROGRESSING   SINCE   STATUS                                             
version   4.14.0-0.okd-scos-2023-06-14-054844   True        False         76s     Cluster version is 4.14.0-0.okd-scos-2023-06-14-054844

cgruver · 2023-06-21T12:55:04Z

Attempting to install with the iPXE method that I mentioned above, I am seeing a different issue when the bootstrap fails to start:

time="2023-06-14T13:07:22Z" level=error msg="Failed to extract ignition to disk" error="failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/openshift/okd-content@sha256:63d26a845541a486d1531b9601b4dc290916590c8e9e86a83228b27fb2c2d373 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput \"... 223] Writing systemd unit \"master-bmh-update.service\"\nF0614 13:07:22.553045    6570 start.go:145] error enabling units: Failed to enable unit: Unit file systemd-journal-gatewayd.socket does not exist.\""
time="2023-06-14T13:07:22Z" level=error msg="Failed to extract ignition to disk, giving up"
***
Omitted INFO logs
***
time="2023-06-14T13:07:41Z" level=error msg="Bootstrap failed failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/openshift/okd-content@sha256:63d26a845541a486d1531b9601b4dc290916590c8e9e86a83228b27fb2c2d373 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput \"... 223] Writing systemd unit \"master-bmh-update.service\"\nF0614 13:07:22.553045    6570 start.go:145] error enabling units: Failed to enable unit: Unit file systemd-journal-gatewayd.socket does not exist.\""

Is the missing Systemd unit relevant?

This was attempting to install 4.13. Same error with FCOS and SCOS.

I'll try the same install method with a 4.14 nightly.

andfasano · 2023-06-22T05:56:03Z

@cgruver fyi we're currently working on adding support for PXE, so it would be something hopefully available soon.

cgruver · 2023-06-22T21:15:27Z

Need to monitor this one for completion too: openshift/installer#6619

That's the error that I'm hitting.

Thanks @vrutkovs

alexk201 · 2023-10-20T10:00:42Z

Are there any updates regarding this issue? I tried running the agent based setup again in a disconnected environment and got other errors this time which I don't know if it's a good thing or not.
I'm currently testing with the latest OKD release (4.13.0-0.okd-2023-09-30-084937) and FCOS 38.20230902.3.0. I'm using @cgruver 's approach of extracting the ignition config using
coreos-installer iso ignition show ocp/agent.x86_64.iso > agent-install.ign
and embedding it in a more recent ISO using
coreos-installer iso ignition embed discovery.iso -i agent-install.ign

The installation fails before the first node can reboot during the installation process. There are two errors frequently coming up during the process:

error enabling units: Failed to enable unit: Unit file systemd-journal-gatewayd.socket does not exist.
/usr/sbin/vgdisplay failed: 0 No volume groups found.

Thanks in advance

JM1 · 2023-10-20T10:55:59Z

We are still trying to get necessary patches merged:

OCPBUGS-4038: bootstrap: Enable gatewayd units only on RHCOS openshift/installer#7580
OCPBUGS-19303: Changed OKD/FCOS workaround to also support Agent-based Installer openshift/installer#7484

It will take a while :/

andfasano · 2023-10-20T12:42:06Z

Yeah trying to land them 🤞. They will be required for supporting fcos setup

alexk201 · 2023-10-21T08:44:28Z

Thank you for the status update. These issues are blocking all agent-based installation platforms, right? So it doesn't matter if I use bare-metal or vSphere for example?!

andfasano · 2023-10-23T09:00:21Z

Right

JM1 · 2023-12-13T17:14:17Z

In a great team effort with @andfasano, @vrutkovs and @aleskandro we managed to fix ABI for OKD/FCOS 🥳 Now, we "only" have to backport all those fixes to 4.14 etc.

JM1 · 2023-12-13T19:01:13Z

Backports to 4.14:

OCPBUGS-27482,OCPBUGS-27483: [release-4.14] Fix Agent-based Installer on OKD/FCOS openshift/installer#7836

Backport to 4.15:

[release-4.15] OCPBUGS-25251: Changed OKD/FCOS workaround to also support Agent-based Installer openshift/installer#7830

titou10titou10 · 2023-12-17T17:37:42Z

Any chance the PR will pass the tests and be merged soon?

alexk201 · 2023-12-19T14:39:45Z

Thanks for the update, keep up the great work!

JM1 · 2024-01-30T11:46:31Z

With latest release of OKD/FCOS 4.15.0-0.okd-2024-01-27-070424 now ABI finally works 🥳

However, no new releases of OKD/FCOS 4.14 will be published. Having said this, there is no point in my backport to 4.14, so i closed it 😕

andfasano · 2024-01-30T11:50:19Z

Thanks all and @JM1 @vrutkovs @aleskandro for the effort!

JM1 · 2024-01-30T11:54:09Z

@andfasano You also had a great part in this, so kudos to you too 🥂

titou10titou10 · 2024-01-30T15:01:52Z

Indeed agent installer seems to work with the latest stable version of okd: 4.15.0-0.okd-2024-01-27-070424
Thanks.
When the node boots for the first time, there is still the "problem" withselinuxservice not starting because "checkmodule" can not be found on FCOS 39.20231101.3.0 used by the installer :

 systemctl status selinux
selinux.service
      Loaded: loaded (/etc/systemd/system/selinux.service; enabled; preset: enabled)
 Drop-In: /usr/lib/systemd/system/service.d
              └─10-timeout-abort.conf
      Active: failed (Result: exit-code) since Tue 2024-01-30 14:39:23 UTC; 5min ago
         CPU: 3ms

Jan 30 14:39:23 localhost systemd[1]: Starting selinux.service...
Jan 30 14:39:23 localhost (ckmodule)[1358]: selinux.service: Failed to locate executable checkmodule: No such file or directory
Jan 30 14:39:23 localhost (ckmodule)[1358]: selinux.service: Failed at step EXEC spawning checkmodule: No such file or directory
Jan 30 14:39:23 localhost systemd[1]: selinux.service: Control process exited, code=exited, status=203/EXEC
Jan 30 14:39:23 localhost systemd[1]: selinux.service: Failed with result 'exit-code'.
Jan 30 14:39:23 localhost systemd[1]: Failed to start selinux.service.

However this does not seem to cause any problem and the installation to succeed...

vrutkovs · 2024-01-30T15:38:45Z

When the node boots for the first time, there is still the "problem" withselinuxservice not starting because "checkmodule" can not be found on FCOS 39.20231101.3.0 used by the installer :

Right, this is common for all assisted flows and needs to be resolved in assisted-installer. The problem is that it needs to use the RPM also installed in RHCOS, which is not trivial.

Feel free to create a separate issue if you consider this worth tracking and fixing

JM1 · 2024-01-30T19:03:24Z

Luckily, the SELinux issue only affects the FCOS Live ISO on first boot; the final installation to disk is fine. If you would know what loops we had to jump through to get all Live ISO based Installers (ABI and non-ABI SNO) working for OKD/FCOS, you would not sleep at night... 🙊😉

OKD/FCOS 4.15.0-0.okd-2024-01-27-070424 [0] is the first OKD release which supports IPI, (non-ABI) SNO and ABI [1]. [0] https://github.com/okd-project/okd/releases/tag/4.15.0-0.okd-2024-01-27-070424 [1] okd-project/okd#1608

andfasano closed this as completed Jan 30, 2024

cluster installation fails using agent-based installer #1608

cluster installation fails using agent-based installer #1608

Comments

alexk201 commented May 20, 2023

alexk201 commented May 20, 2023

vrutkovs commented May 22, 2023

alexk201 commented May 22, 2023

alexk201 commented May 22, 2023

vrutkovs commented May 22, 2023

alexk201 commented May 22, 2023 • edited Loading

alexk201 commented May 22, 2023 • edited Loading

vrutkovs commented May 22, 2023

vrutkovs commented May 22, 2023

andfasano commented May 22, 2023

alexk201 commented May 22, 2023 • edited Loading

vrutkovs commented May 22, 2023

alexk201 commented May 22, 2023

vrutkovs commented May 23, 2023

alexk201 commented May 23, 2023

alexk201 commented May 23, 2023

andfasano commented May 23, 2023 • edited Loading

alexk201 commented May 23, 2023

andfasano commented May 23, 2023

alexk201 commented May 23, 2023

andfasano commented May 23, 2023

alexk201 commented May 23, 2023

alexk201 commented May 23, 2023

cgruver commented Jun 13, 2023

andfasano commented Jun 14, 2023

cgruver commented Jun 21, 2023 • edited Loading

andfasano commented Jun 22, 2023

cgruver commented Jun 22, 2023

alexk201 commented Oct 20, 2023 • edited Loading

JM1 commented Oct 20, 2023

andfasano commented Oct 20, 2023

alexk201 commented Oct 21, 2023

andfasano commented Oct 23, 2023

JM1 commented Dec 13, 2023

JM1 commented Dec 13, 2023

titou10titou10 commented Dec 17, 2023

alexk201 commented Dec 19, 2023

JM1 commented Jan 30, 2024

andfasano commented Jan 30, 2024

JM1 commented Jan 30, 2024

titou10titou10 commented Jan 30, 2024 • edited Loading

vrutkovs commented Jan 30, 2024 • edited Loading

JM1 commented Jan 30, 2024

alexk201 commented May 22, 2023 •

edited

Loading

alexk201 commented May 22, 2023 •

edited

Loading

alexk201 commented May 22, 2023 •

edited

Loading

andfasano commented May 23, 2023 •

edited

Loading

cgruver commented Jun 21, 2023 •

edited

Loading

alexk201 commented Oct 20, 2023 •

edited

Loading

titou10titou10 commented Jan 30, 2024 •

edited

Loading

vrutkovs commented Jan 30, 2024 •

edited

Loading