Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster installation fails using agent-based installer #1608

Closed
alexk201 opened this issue May 20, 2023 · 43 comments
Closed

cluster installation fails using agent-based installer #1608

alexk201 opened this issue May 20, 2023 · 43 comments

Comments

@alexk201
Copy link

Describe the bug
I am unable to install OKD on a VM with the agent based installer. During the installation, I always receive the following error:

INFO Host okd-sno: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
INFO Cluster installation in progress
INFO Host okd-sno: updated status from preparing-successful to installing (Installation is in progress)
INFO Host: okd-sno, reached installation stage Installing: bootstrap
INFO Host: okd-sno, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/okd/scos-content@sha256:4e5abfb4ca9de3c43f6a724f489f34bb3e53ff932e1f0887b9489303e98c88b8 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists"
INFO Cluster has hosts in error
INFO cluster has stopped installing... working to recover installation

Since I can monitor the installation with openshift-install, ssh into the vm etc, I find it unlikely to be a config error but maybe I'm wrong here. I can confirm that the file "/run/ostree/auth.json" is created moments before the error occurs. It's also reproducible when manually running the command on the VM:

$ sudo podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/okd/scos-content@sha256:4e5abfb4ca9de3c43f6a724f489f34bb3e53ff932e1f0887b9489303e98c88b8 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot
I0520 19:47:26.710198   18107 start.go:96] Version: machine-config-daemon-4.6.0-202006240615.p0-2008-g70aa0a56-dirty (70aa0a560c0b0a01093f695cb358a8749d30b3d2)
I0520 19:47:26.710219   18107 start.go:109] Calling chroot("/rootfs")
F0520 19:47:26.710595   18107 start.go:137] Failed to initialize single run daemon: error initializing rpm-ostree: Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists

When I manually delete the file and run the command again, it kinda works at first, then fails with a different error:

[...]
I0520 19:48:31.938276   18522 update.go:1484] Preset systemd unit zincati.service
I0520 19:48:31.938316   18522 file_writers.go:223] Writing systemd unit "install-to-disk.service"
F0520 19:48:31.947116   18522 start.go:145] error enabling units: Failed to enable unit: Unit file systemd-journal-gatewayd.socket does not exist.

Version
I tried multiple versions including various releases, down from 4.12.0-0.okd-2023-03-18-084815 up to 4.13.0-0.okd-scos-2023-05-04-192252 (what does the scos stand for btw?!)

How reproducible
100% reproducible

@alexk201
Copy link
Author

I don't think it's related to my problem but I get the following when SSHing into the VM:

[systemd]
Failed Units: 1
  selinux.service
[core@okd-sno ~]$ journalctl -b -u selinux.service
May 20 21:17:05 localhost systemd[1]: Starting selinux.service...
May 20 21:17:05 localhost systemd[1152]: selinux.service: Failed to locate executable checkmodule: No such file or directory
May 20 21:17:05 localhost systemd[1152]: selinux.service: Failed at step EXEC spawning checkmodule: No such file or directory
May 20 21:17:05 localhost systemd[1]: selinux.service: Control process exited, code=exited, status=203/EXEC
May 20 21:17:05 localhost systemd[1]: selinux.service: Failed with result 'exit-code'.
May 20 21:17:05 localhost systemd[1]: Failed to start selinux.service.

@vrutkovs
Copy link
Member

Which FCOS are you using for discovery ISO? See recommended values at https://github.com/openshift/assisted-service/blob/master/deploy/podman/okd-configmap.yml#L30-L31

I don't think it's related to my problem but I get the following when SSHing into the VM:

This is known (might worth a separate bug): assisted-installer tries to set custom selinux rules, but needs "checkmodule" binary, which is available in RHCOS and not in FCOS/SCOS

what does the scos stand for btw?!

(centos) Stream Core OS

@alexk201
Copy link
Author

TL;DR: I use FCOS Version 37.20221127.3.0

Detail:
I created a new ISO for release 4.12.0-0.okd-2023-03-18-084815, since this is the version listed in the recommended values you provided.

I downloaded the client tooling for that release again:

$ openshift-install version
openshift-install 4.12.0-0.okd-2023-03-18-084815
built from commit 4688870d3a709eea34fe2bb5d1c62dea2cfd7e91
release image quay.io/openshift/okd@sha256:7153ed89133eeaca94b5fda702c5709b9ad199ce4ff9ad1a0f01678d6ecc720f
release architecture amd64

Then I created the ISO using openshift-install agent create image and started the installation.

cat /etc/*release
Fedora release 37 (Thirty Seven)
NAME="Fedora Linux"
VERSION="37.20221127.3.0 (CoreOS)"
ID=fedora
VERSION_ID=37
VERSION_CODENAME=""
PLATFORM_ID="platform:f37"
PRETTY_NAME="Fedora CoreOS 37.20221127.3.0"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:37"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=37
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=37
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='37.20221127.3.0'
Fedora release 37 (Thirty Seven)
Fedora release 37 (Thirty Seven)

This version does not match the 'machine-os' component version from https://github.com/okd-project/okd/releases/tag/4.12.0-0.okd-2023-03-18-084815 (37.20230218.3). I thought the openshift-installer would download the FCOS version matching the OKD release or am I missing something?

@alexk201
Copy link
Author

I tried overriding the base ISO by manually downloading FCOS 37.20221225.3.0 and placing it in /root/.cache/agent/image_cache/coreos-x86_64.iso, but the installer does not accept it and overwrites it:

$ mkdir -p /root/.cache/agent/image_cache/
$ wget -qq https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/37.20221225.3.0/x86_64/fedora-coreos-37.20221225.3.0-live.x86_64.iso -O /root/.cache/agent/image_cache/coreos-x86_64.iso

$ openshift-install agent create image
[...]
msg=The file was found in cache: /root/.cache/agent/image_cache/coreos-x86_64.iso
level=info msg=Verifying cached file
level=debug msg=extracting /coreos/coreos-x86_64.iso.sha256 to /tmp/cache849938469, oc image extract --path /coreos/coreos-x86_64.iso.sha256:/tmp/cache849938469 --confirm --icsp-file=/tmp/icsp-file4064866473 quay.io/openshift/okd-content@sha256:feebbf5fdeaebbe67347255ee1aeef93e2e1da0e6da966deaaa095589cf0373d
level=debug msg=Cached file /root/.cache/agent/image_cache/coreos-x86_64.iso is not most recent
level=debug msg=extracting /coreos/coreos-x86_64.iso to /root/.cache/agent/image_cache, oc image extract --path /coreos/coreos-x86_64.iso:/root/.cache/agent/image_cache --confirm --icsp-file=/tmp/icsp-file4243370380 quay.io/openshift/okd-content@sha256:feebbf5fdeaebbe67347255ee1aeef93e2e1da0e6da966deaaa095589cf0373d
level=info msg=Base ISO obtained from release and cached at /root/.cache/agent/image_cache/coreos-x86_64.iso

Is there a way to override the ISO image? Why does the installer even want to use an older image?

@vrutkovs
Copy link
Member

This version does not match the 'machine-os' component version from 4.12.0-0.okd-2023-03-18-084815 (release) (37.20230218.3). I thought the openshift-installer would download the FCOS version matching the OKD release or am I missing something?

These may be different. Installer has a hardcoded list of images to use, so it may be different from what machine-os-content is based at. If the machine boots rpm-ostree will update from installer ISO to machine-os contents - but in this case this operation is stopped due to a bug in installer-provided ISO.

There is a way to override installer initial ISO.

This is already fixed in master (openshift/installer#6902 makes installer use latest F37 image with auth bug fixed), but it has not yet reached 4.13 or 4.12 yet, sorry. So for now its recommended to override initial FCOS

@alexk201
Copy link
Author

alexk201 commented May 22, 2023

It seems like openshift-install does not pick up the mentioned environment variable (on my setup).
I tried to export a couple of different environment variable values for OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE, but they are all ignored:

The installer still uses the image extraced from the okd release

level=debug msg=Fetching image from OCP release (oc adm release info --image-for=machine-os-images --insecure=true --icsp-file=/tmp/icsp-file304177815 quay.io/openshift/okd@sha256:2b3d90157565bb1e227c1cd182154b498c4cf76360d8a57cc5d6d5a4a63794cb)
level=debug msg=extracting /coreos/coreos-x86_64.iso to /root/.cache/agent/image_cache, oc image extract --path /coreos/coreos-x86_64.iso:/root/.cache/agent/image_cache --confirm --icsp-file=/tmp/icsp-file685273286 quay.io/openshift/okd-content@sha256:feebbf5fdeaebbe67347255ee1aeef93e2e1da0e6da966deaaa095589cf0373d
level=info msg=Base ISO obtained from release and cached at /root/.cache/agent/image_cache/coreos-x86_64.iso
level=debug msg=Extracted base ISO image /root/.cache/agent/image_cache/coreos-x86_64.iso from release payload

@alexk201
Copy link
Author

alexk201 commented May 22, 2023

Is the agent-based installer gonna be the preferred installation method for disconnected environments or could the on premise assisted installer (as described in your blog post here https://vrutkovs.eu/posts/okd-disconnected-assisted/) be an alternative? The company I'm working at want to migrate to K8s with OpenShift being the preferred platform, but with strict network limitations (basically fully disconnected)

@vrutkovs
Copy link
Member

If you're planning to install more cluster and manage them afterwards on-premise assisted installer would be preferred. Agent installer is more of a one-shot install for "cluster 0" to run apps / host Assisted Service for other cluster and so on

@vrutkovs
Copy link
Member

@andfasano do you remember if OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE can be applied or there is any other way to override ISO in machine-os-images for OKD?

@andfasano
Copy link

Note: at the current moment ABI integration with OKD is not working. I've been able to prepare a successfull working PoC here openshift/installer#7112 using an SCOS image for the connected environment - but for the disconnected environment we'll need to support such image in machine-os-images (cc @sherine-k ).
For what regards FCOS image instead didn't work in my latest tests.
Looks like OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE is not honored by ABI, I'll open a separate PR for that.

Btw, just for clarity, ABI is not just for one-shot install, but aims to bring in the ease of use of Assisted Installer experience especially for disconnected environment, offering at the same time a full automation and simplicity of use. For more info see also https://cloud.redhat.com/blog/meet-the-new-agent-based-openshift-installer-1

@alexk201
Copy link
Author

alexk201 commented May 22, 2023

Maybe I misunderstood something but shouldn't it be possible to create a fix / update for the openshift-installer with a later version of the fcos.json to resolve my issue?

I actually tried this and simply replaced the contents of https://github.com/openshift/installer/blob/release-4.12/data/data/coreos/fcos.json with the latest version https://github.com/openshift/installer/blob/master/data/data/coreos/fcos.json. I can build and run the installer, but it then selects a release from rhcos.json. The generated ISO is bootable but creates a different error regarding a missing cluster-id.

What I tried:

git clone https://github.com/openshift/installer.git
git checkout 4688870d3a # I used the exact same commit from the latest 4.12 okd release just to be sure...
rm data/data/coreos/fcos.json
wget https://raw.githubusercontent.com/openshift/installer/master/data/data/coreos/fcos.json -O data/data/coreos/fcos.json
bash hack/build.sh

I guess I built the binary for openshift, not OKD. There seems to be an isOKD switch but I haven't figured out how to use it..

@vrutkovs
Copy link
Member

env TAGS=okd hack/build.sh should do it (see here)

@alexk201
Copy link
Author

Exporting the TAGS environment variable solved the issue and I am actually able to build a modified version of the openshift-install cli with the updated FCOS image hashes:

level=debug msg=The file was found in cache: /root/.cache/agent/image_cache/coreos-x86_64.iso
level=info msg=Verifying cached file
level=debug msg=Found matching hash in installer metadata
level=info msg=Using cached Base ISO /root/.cache/agent/image_cache/coreos-x86_64.iso

The assisted installer is now running on the FCOS I specified, latest attempt with the new 4.13 release: Fedora CoreOS 38.20230414.3.0.

But there's still another issue: The generated ISO tries to install OCP, not OKD. Since I don't have credentials set up for OCP, I get the following error:

pull secret for new cluster is invalid: pull secret must contain auth for \"registry.ci.openshift.org\"

I already tried overriding OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE and specifying a custom cluster-image-set.yaml. This results in a different error:

Error: No OS image for Openshift version 4.13.0-0.okd-2023-05-22-052007 and architecture x86_64: The requested OS image for version (4.13.0-0.okd-2023-05-22-052007) and CPU architecture (x86_64) isn't specified in OS images list" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).RegisterInfraEnvInternal.func1"

Maybe you've got another hint for me?!

@vrutkovs
Copy link
Member

Try running the prebuilt installer with OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/openshift/okd:4.13.0-0.okd-2023-05-22-052007 env var?

@alexk201
Copy link
Author

The prebuilt installer still uses FCOS 37.20221127.3.0 (https://github.com/openshift/installer/blob/release-4.13/data/data/coreos/fcos.json)

@alexk201
Copy link
Author

I tried it anyways and got the same error that I started with, auth.json: file exists

@andfasano
Copy link

andfasano commented May 23, 2023

For what regards FCOS image instead didn't work in my latest tests.

I don't think that part has been fixed yet. SCOS instead works fine

@alexk201
Copy link
Author

now using quay.io/okd/scos-content@sha256:116b7b210b1c1fd43fb9974e32c0c4923f29a3b581f444f8e33452cd9ad26ea4 with the same result

INFO Host okd-sno: New image status quay.io/okd/scos-content@sha256:116b7b210b1c1fd43fb9974e32c0c4923f29a3b581f444f8e33452cd9ad26ea4. result: success. time: 2.88 seconds; size: 416.16 Megabytes; download rate: 151.66 MBps
INFO Host okd-sno: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
INFO Cluster installation in progress
INFO Host okd-sno: updated status from preparing-successful to installing (Installation is in progress)
INFO Host: okd-sno, reached installation stage Installing: bootstrap
INFO Host: okd-sno, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/okd/scos-content@sha256:4e5abfb4ca9de3c43f6a724f489f34bb3e53ff932e1f0887b9489303e98c88b8 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput "Error while ensuring access to kublet config.json pull secrets: symlink /var/lib/kubelet/config.json /run/ostree/auth.json: file exists"
INFO Cluster has hosts in error
INFO cluster has stopped installing... working to recover installation

@andfasano
Copy link

Did you use the code from openshift/installer#7112?

@alexk201
Copy link
Author

Nope, I used the tools from oc adm release extract --tools quay.io/okd/scos-release:4.13.0-0.okd-scos-2023-05-04-192252. Should I check out your branch and build the installer myself?

@andfasano
Copy link

At the current moment it's the only way I tested to be working, if you'd like to try out ABI with OKD. I've anyhow opened this bug https://issues.redhat.com/browse/OCPBUGS-13955 that will make not necessary to rebuild the installer from src

@alexk201
Copy link
Author

I tried that, now I'm back to this result:

No OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE:

pull secret for new cluster is invalid: pull secret must contain auth for \"registry.ci.openshift.org\"

OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE quay.io/okd/scos-release:4.13.0-0.okd-scos-2023-05-04-192252

Error: No OS image for Openshift version 4.13.0-0.okd-scos-2023-05-04-192252 and architecture x86_64: The requested OS image for version (4.13.0-0.okd-scos-2023-05-04-192252) and CPU architecture (x86_64) isn't specified in OS images list

It uses SCOS tho:

level=debug msg=Obtaining RHCOS image file from 'https://okd-scos.s3.amazonaws.com/okd-scos/builds/414.9.202304170609-0/x86_64/scos-414.9.202304170609-0-live.x86_64.iso'
level=debug msg=Unpacking file into "/root/.cache/agent/image_cache/scos-414.9.202304170609-0-live.x86_64.iso"...

@alexk201
Copy link
Author

Guess I'm just gonna wait for this to be fixed in a future release. Thanks for your help and fast response times tho :)
@vrutkovs @andfasano give these men a raise 🥇

@cgruver
Copy link

cgruver commented Jun 13, 2023

@alexk201 There is another work around that I am trying this morning.

So far it seems to be working. I discovered that I'm going to have to add IP reservations to my router, because nmstatectl will not run on a Mac... :-(

Anyway, I digress. The workaround is to extract the ignition config from the generated ISO, then use that ignition config to boot with the correct OS image from the release bundle.

I'll post more if I'm successful, but this might get you started:

Note: Apologies for all of the env vars. The snippets below are extracted from my lab scripts.

Create the ISO:

openshift-install --dir=${WORK_DIR}/okd-install-dir agent create image

Extract the ignition config:

coreos-installer iso ignition show agent.x86_64.iso > agent-install.ign

Extract the PXE boot artifacts:

KERNEL_URL=$(openshift-install coreos print-stream-json | jq -r '.architectures.x86_64.artifacts.metal.formats.pxe.kernel.location')
  INITRD_URL=$(openshift-install coreos print-stream-json | jq -r '.architectures.x86_64.artifacts.metal.formats.pxe.initramfs.location')
  ROOTFS_URL=$(openshift-install coreos print-stream-json | jq -r '.architectures.x86_64.artifacts.metal.formats.pxe.rootfs.location')

curl -o /usr/local/www/install/fcos/${OKD_RELEASE}/vmlinuz ${KERNEL_URL}
curl -o /usr/local/www/install/fcos/${OKD_RELEASE}/initrd ${INITRD_URL}
curl -o /usr/local/www/install/fcos/${OKD_RELEASE}/rootfs.img ${ROOTFS_URL}

I'm using iPXE, so my iPXE file looks something like:

#!ipxe

kernel http://${INSTALL_HOST_IP}/install/fcos/${OKD_RELEASE}/vmlinuz edd=off net.ifnames=1 ifname=nic0:${mac} ip=${ip_addr}::${DOMAIN_ROUTER}:${DOMAIN_NETMASK}:${hostname}:nic0:none nameserver=${DOMAIN_ROUTER} rd.neednet=1 coreos.inst.install_dev=/dev/sda coreos.inst.ignition_url=http://${INSTALL_HOST_IP}/install/fcos/agent-boot/${CLUSTER_NAME}.${DOMAIN}/agent-install.ign coreos.inst.platform_id=${platform} initrd=initrd initrd=rootfs.img ${CONSOLE_OPT}
initrd http://${INSTALL_HOST_IP}/install/fcos/${OKD_RELEASE}/initrd
initrd http://${INSTALL_HOST_IP}/install/fcos/${OKD_RELEASE}/rootfs.img

boot

@andfasano
Copy link

fyi openshift/installer#7211 landed so now it could be used as a temporary workaround to deploy okd-scos into a (connected) environment. Note that such approach is meant for test only - but at least a way to try it out until the proper integration with okd will be implemented.

I've been able to successfully setup a cluster by using an installer extracted from 4.14.0-0.okd-scos-2023-06-14-054844:

$ export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=https://okd-scos.s3.amazonaws.com/okd-scos/builds/414.9.202305040609-0/x86_64/scos-414.9.202305040609-0-live.x86_64.iso
$ ./openshift-install agent create image
...
$ oc get clusterversions version                                               
NAME      VERSION                               AVAILABLE   PROGRESSING   SINCE   STATUS                                             
version   4.14.0-0.okd-scos-2023-06-14-054844   True        False         76s     Cluster version is 4.14.0-0.okd-scos-2023-06-14-054844

@cgruver
Copy link

cgruver commented Jun 21, 2023

Attempting to install with the iPXE method that I mentioned above, I am seeing a different issue when the bootstrap fails to start:

time="2023-06-14T13:07:22Z" level=error msg="Failed to extract ignition to disk" error="failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/openshift/okd-content@sha256:63d26a845541a486d1531b9601b4dc290916590c8e9e86a83228b27fb2c2d373 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput \"... 223] Writing systemd unit \"master-bmh-update.service\"\nF0614 13:07:22.553045    6570 start.go:145] error enabling units: Failed to enable unit: Unit file systemd-journal-gatewayd.socket does not exist.\""
time="2023-06-14T13:07:22Z" level=error msg="Failed to extract ignition to disk, giving up"
***
Omitted INFO logs
***
time="2023-06-14T13:07:41Z" level=error msg="Bootstrap failed failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --net host --pid=host --volume /:/rootfs:rw --volume /usr/bin/rpm-ostree:/usr/bin/rpm-ostree --privileged --entrypoint /usr/bin/machine-config-daemon quay.io/openshift/okd-content@sha256:63d26a845541a486d1531b9601b4dc290916590c8e9e86a83228b27fb2c2d373 start --node-name localhost --root-mount /rootfs --once-from /opt/install-dir/bootstrap.ign --skip-reboot], Error exit status 255, LastOutput \"... 223] Writing systemd unit \"master-bmh-update.service\"\nF0614 13:07:22.553045    6570 start.go:145] error enabling units: Failed to enable unit: Unit file systemd-journal-gatewayd.socket does not exist.\""

Is the missing Systemd unit relevant?

This was attempting to install 4.13. Same error with FCOS and SCOS.

I'll try the same install method with a 4.14 nightly.

@andfasano
Copy link

@cgruver fyi we're currently working on adding support for PXE, so it would be something hopefully available soon.

@cgruver
Copy link

cgruver commented Jun 22, 2023

Need to monitor this one for completion too: openshift/installer#6619

That's the error that I'm hitting.

Thanks @vrutkovs

@alexk201
Copy link
Author

alexk201 commented Oct 20, 2023

Are there any updates regarding this issue? I tried running the agent based setup again in a disconnected environment and got other errors this time which I don't know if it's a good thing or not.
I'm currently testing with the latest OKD release (4.13.0-0.okd-2023-09-30-084937) and FCOS 38.20230902.3.0. I'm using @cgruver 's approach of extracting the ignition config using
coreos-installer iso ignition show ocp/agent.x86_64.iso > agent-install.ign
and embedding it in a more recent ISO using
coreos-installer iso ignition embed discovery.iso -i agent-install.ign

The installation fails before the first node can reboot during the installation process. There are two errors frequently coming up during the process:

  1. error enabling units: Failed to enable unit: Unit file systemd-journal-gatewayd.socket does not exist.
  2. /usr/sbin/vgdisplay failed: 0 No volume groups found.

Thanks in advance

@JM1
Copy link

JM1 commented Oct 20, 2023

@andfasano
Copy link

Yeah trying to land them 🤞. They will be required for supporting fcos setup

@alexk201
Copy link
Author

Thank you for the status update. These issues are blocking all agent-based installation platforms, right? So it doesn't matter if I use bare-metal or vSphere for example?!

@andfasano
Copy link

Right

@JM1
Copy link

JM1 commented Dec 13, 2023

In a great team effort with @andfasano, @vrutkovs and @aleskandro we managed to fix ABI for OKD/FCOS 🥳 Now, we "only" have to backport all those fixes to 4.14 etc.

@titou10titou10
Copy link

Any chance the PR will pass the tests and be merged soon?

@alexk201
Copy link
Author

Thanks for the update, keep up the great work!

@JM1
Copy link

JM1 commented Jan 30, 2024

With latest release of OKD/FCOS 4.15.0-0.okd-2024-01-27-070424 now ABI finally works 🥳

However, no new releases of OKD/FCOS 4.14 will be published. Having said this, there is no point in my backport to 4.14, so i closed it 😕

@andfasano
Copy link

Thanks all and @JM1 @vrutkovs @aleskandro for the effort!

@JM1
Copy link

JM1 commented Jan 30, 2024

@andfasano You also had a great part in this, so kudos to you too 🥂

@titou10titou10
Copy link

titou10titou10 commented Jan 30, 2024

Indeed agent installer seems to work with the latest stable version of okd: 4.15.0-0.okd-2024-01-27-070424
Thanks.
When the node boots for the first time, there is still the "problem" withselinuxservice not starting because "checkmodule" can not be found on FCOS 39.20231101.3.0 used by the installer :

 systemctl status selinux
selinux.service
      Loaded: loaded (/etc/systemd/system/selinux.service; enabled; preset: enabled)
 Drop-In: /usr/lib/systemd/system/service.d
              └─10-timeout-abort.conf
      Active: failed (Result: exit-code) since Tue 2024-01-30 14:39:23 UTC; 5min ago
         CPU: 3ms

Jan 30 14:39:23 localhost systemd[1]: Starting selinux.service...
Jan 30 14:39:23 localhost (ckmodule)[1358]: selinux.service: Failed to locate executable checkmodule: No such file or directory
Jan 30 14:39:23 localhost (ckmodule)[1358]: selinux.service: Failed at step EXEC spawning checkmodule: No such file or directory
Jan 30 14:39:23 localhost systemd[1]: selinux.service: Control process exited, code=exited, status=203/EXEC
Jan 30 14:39:23 localhost systemd[1]: selinux.service: Failed with result 'exit-code'.
Jan 30 14:39:23 localhost systemd[1]: Failed to start selinux.service.

However this does not seem to cause any problem and the installation to succeed...

@vrutkovs
Copy link
Member

vrutkovs commented Jan 30, 2024

When the node boots for the first time, there is still the "problem" withselinuxservice not starting because "checkmodule" can not be found on FCOS 39.20231101.3.0 used by the installer :

Right, this is common for all assisted flows and needs to be resolved in assisted-installer. The problem is that it needs to use the RPM also installed in RHCOS, which is not trivial.

Feel free to create a separate issue if you consider this worth tracking and fixing

@JM1
Copy link

JM1 commented Jan 30, 2024

Luckily, the SELinux issue only affects the FCOS Live ISO on first boot; the final installation to disk is fine. If you would know what loops we had to jump through to get all Live ISO based Installers (ABI and non-ABI SNO) working for OKD/FCOS, you would not sleep at night... 🙊😉

JM1 added a commit to JM1/ansible-collection-jm1-cloudy that referenced this issue Jan 31, 2024
OKD/FCOS 4.15.0-0.okd-2024-01-27-070424 [0] is the first OKD release
which supports IPI, (non-ABI) SNO and ABI [1].

[0] https://github.com/okd-project/okd/releases/tag/4.15.0-0.okd-2024-01-27-070424
[1] okd-project/okd#1608
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants