-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flatcar doesn't boot on OpenStack #15385
Comments
I found part of the problem! additionalUserData:
- name: ps_cloud_init.txt
type: text/cloud-config
content: |
REDACTED Without additionalUserData set, the instances boot but don't join the cluster. |
Did you try that with a fresh cluster? |
I tried it with a fresh cluster.
I'm not quite sure what causes the error as the nodes seem fine: master-az1-1-er3ovv ~ # systemctl --all --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed
master-az1-1-er3ovv ~ # ctr -n k8s.io c ls
CONTAINER IMAGE RUNTIME
07246fe3adda81a248699108803e116b5260b4c7c391679d4e343967f0e25831 registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
12b4eddacab946e874dec0675e80c3e7cd81a52755ada184d4d9b7f9d6bf8330 registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
12e60b40ad10e604b863ae684c41271af426982a03aa97786cd3dafce0b6a6a4 registry.k8s.io/kube-controller-manager@sha256:23a76a71f2b39189680def6edc30787e40a2fe66e29a7272a56b426d9b116229 io.containerd.runc.v2
4688c12cdfcf9366fc8523409115494823a24b4f1ba0ccdb026d1230cef67e27 registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
54a21defe868f2f86bb588b5adb69d673879b06fd33f906b2fa6b558e6a38477 registry.k8s.io/etcdadm/etcd-manager@sha256:5ffb3f7cade4ae1d8c952251abb0c8bdfa8d4d9acb2c364e763328bd6f3d06aa io.containerd.runc.v2
643563a3d3e4ab40fa49b632c144d918e9cad9d94e4bcd5d47e285923060024a registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
678db0d6c86b5b694707dca9d0300d8d2107be82abb4fa36604e5c7799c139dd registry.k8s.io/kube-controller-manager@sha256:23a76a71f2b39189680def6edc30787e40a2fe66e29a7272a56b426d9b116229 io.containerd.runc.v2
83da13e648f1d3b52dadfccb6f05c9cc9d7d28849aefd8797e0b70630daed1ca registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
8bf86f696e1f9cc556100df803fb425217c0216af702d03722b46be078a11b40 registry.k8s.io/kube-apiserver@sha256:c8518e64657ff2b04501099d4d8d9dd402237df86a12f7cc09bf72c080fd9608 io.containerd.runc.v2
8e41f4eaa58fce83da9d6cd8a421efef04df9176d98f9e8f85bc48623fbefccd registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db io.containerd.runc.v2
972810dc74091a0cb8bca9518e5cd401c5e2ba2595780e43cb3a9d9e78dc8fcd registry.k8s.io/etcdadm/etcd-manager@sha256:5ffb3f7cade4ae1d8c952251abb0c8bdfa8d4d9acb2c364e763328bd6f3d06aa io.containerd.runc.v2
af2af2a34bf1a442213495428cb00b35047512f115dec94dad92e776f8a75e06 registry.k8s.io/kube-proxy@sha256:42fe09174a5eb6b8bace3036fe253ed7f06be31d9106211dcc4a09f9fa99c79a io.containerd.runc.v2
c8feaf253772950062b921e4f59369aae6d988940b79fa32da14dc9977681bb0 registry.k8s.io/kops/kube-apiserver-healthcheck@sha256:547c6bf1edc798e64596aa712a5cfd5145df0f380e464437a9313c1f1ae29756 io.containerd.runc.v2
c9dfe8396146b76247b262085a7a701ac5ece72847fb72984d2778cb1d24b28d registry.k8s.io/kube-scheduler@sha256:19712fa46b8277aafd416b75a3a3d90e133f44b8a4dae08e425279085dc29f7e io.containerd.runc.v2
f6f69768c5571fe745d63c7ba0022ed91b010594363e3fb3d1a037ae358e02c5 registry.k8s.io/kube-apiserver@sha256:c8518e64657ff2b04501099d4d8d9dd402237df86a12f7cc09bf72c080fd9608 io.containerd.runc.v2 Kubelet constantly logs the following error:
Please let me know if you need more logs or info. |
This means that your control plane is up and running. |
I looked into it and can't find the issue :( |
Could you try creating the cluster with |
@zetaab Any idea on what it may be wrong here? |
no idea, I have not used flatcar (we are using ubuntu). I can try it tomorrow |
Creating the cluster with --dns=none doesn't seem to fix the issue. |
The goal is to understand why the failure happens. You are the only person with access to the logs. |
The issue seems to stem from the fact that flatcar uses the FQDN of the node as a hostname. The node registers itself using the short name, and then tries to authenticate itself against the control plane using the fqdn. This leads to errors in the
I accessed the node and ran the following: hostnamectl set-hostname "nodes-nova-x04eyu"
systemctl restart systemd-networkd
systemctl restart kubelet After which the node joined: root@openstack-antelope:~# kubectl get nodes
NAME STATUS ROLES AGE VERSION
control-plane-nova-rruwba.novalocal Ready control-plane 6m35s v1.26.3
nodes-nova-x04eyu Ready node 67s v1.26.3 This seems to be an older issue that manifested on AWS as well: flatcar/Flatcar#707
As a side note, kops validate cluster continues to fail with: root@openstack-antelope:~# kops validate cluster
Using cluster from kubectl context: my-cluster.k8s.local
Validating cluster my-cluster.k8s.local
INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
control-plane-nova ControlPlane m1.medium 1 1 nova
nodes-nova Node m1.medium 1 1 nova
NODE STATUS
NAME ROLE READY
nodes-nova-x04eyu node True
VALIDATION ERRORS
KIND NAME MESSAGE
Machine b776c5b9-85e2-423b-afb8-79c5b61883ef machine "b776c5b9-85e2-423b-afb8-79c5b61883ef" has not yet joined cluster
Validation Failed
Error: validation failed: cluster not yet healthy Even though the control plane node is up and |
Hi folks, A short update. A fix for this issue has merged in flatcar and is now available in the nightly builds of the next alpha release. If you want to test it out, you can download it here: https://bincache.flatcar-linux.net/images/amd64/3602.0.0/flatcar_production_openstack_image.img.bz2 Keep in mind this is not a stable release. Thanks! |
Thanks for the update @gabriel-samfira. Any thoughts / info about additional userdata for cloudinit not working? |
Flatcar is normally configured using ignition during first boot. To maintain compatibility with cloud-init based environments, it also has it's own agent that implements a subset of what The additional userdata feature in kops uses the MIME multipart feature in cloud-init which allows it to add multiple files inside userdata. This particular feature of cloud-init is not implemented in coreos-cloudinit. There are two options to get this working. Either we implement multipart in What do you think would be the best path forward? |
So far the approach followed in similar efforts like CAPI support was to use Ignition (Fedora CoreOS and other Ignition users will also benefit from that). |
At the moment, kOps doesn't have a way to know much about the distro image that is used before booting. It may be possible, but would require updating the implementation of all supported cloud providers. As things stand I see 3 possibilities:
Any thoughts about 2 & 3? |
I think we can have both 2 & 3. The short term solution would be to have MIME multipart support in I will open a separate issue for adding The immediate issue reported here should be fixed (sans the |
A PR was created to add multipart support to |
Thanks @gabriel-samfira.I appreciate the update. |
I tested the newest Flatcar Alpha Image and kOps bootstrapped the cluster succesfully. 👍 |
Multipart mime support has been merged in the main branch of flatcar. This will probably be part of the next alpha release. This means you'll be able to use |
Excellent. Thanks a lot @gabriel-samfira! |
I encountered a similar issue with flatcar using Kops 1.27 The static hostname assigned to the hosts has the So you get errors like this on the worker nodes:
After manually changing the hostname, the node connects to the Cluster without issue.
After fix:
This issue is fixed with the beta flatcar release |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
From my side this can be closed. The current flatcar stable (3760.2.0) release works. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/kind bug
1. What
kops
version are you running? The commandkops version
, will displaythis information.
Client version: 1.26.3
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:33:11Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.9", GitCommit:"a1a87a0a2bcd605820920c6b0e618a8ab7d117d4", GitTreeState:"clean", BuildDate:"2023-04-12T12:08:36Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
OpenStack
4. What commands did you run? What is the simplest way to reproduce this issue?
-> Timeout
5. What happened after the commands executed?
Validation of the cluster never succeeds as systemd bootup of instances fails.
A look at the console of the instances reveals that flatcars ignition-fetch.service fails to start:
6. What did you expect to happen?
Flatcar boots up normally.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Anything else do we need to know?
I compared the user data generated by kOps and other tools (Gardener) and they appear to be using a completely diffrent format.
kOps:
Gardener:
The text was updated successfully, but these errors were encountered: