cluster.yaml shows node that does not exist and is not listed with kubectl get nodes #4311

goran-insby · 2023-11-19T22:29:37Z

Summary

In cluster.yaml I can see IP address of node that was removed some time ago. This node that is listed in cluster.yaml is not shown with kubectl get nodes [which is correct].

The problem is that calico tries to contact this server constantly, which I think leads to dqlite going crazy and spiking over 100%.

This situation looks like "lost quorum" but node is listed only in cluster.yaml and does not appear with kubectl get nodes, which is strange.

What Should Happen Instead?

Expected behaviour is that cluster.yaml is aligned with output from kubectl get nodes

Reproduction Steps

This is situation on one cluster. I tried to stop all nodes, remove manually ip address from cluster.yaml and start it, but unfortunately it came back to cluster.yaml, so this is not the place to make the change, I assume.

The text was updated successfully, but these errors were encountered:

ktsakalozos · 2023-11-20T06:33:33Z

Hi @goran-insby thank you for reporting this. Could you please describe in more details how the cluster reached that state? What are the steps to follow to get the node removed but still in cluster.yaml.

Could you please share an microk8s inspect tarball or at least the cluster.yaml? I am insterested to know what was the role of the offending node.

goran-insby · 2023-11-20T22:03:37Z

Hi @ktsakalozos, role in cluster.yaml that is assigned is 2, which would mean spare, I think.
Node was part of the cluster, then it was removed from the cluster with "microk8s leave".

zoc · 2024-04-26T11:26:14Z

Hi @goran-insby,

Where you able to find a solution for this ?

I'm in a similar situation: node removed from the cluster but still appearing in cluster.yaml and as a consequence in the "datastore standby nodes" list when running microk8s status.

I have however not seen any weird behavior from dqlite itself nor calico, but I would like to remove this to avoid potential future burden. To answer @ktsakalozos's question, in my case the dead node has role set to '1' in cluster.yaml:

- ID: 3297041220608546238
  Address: '[fd00:dead:babe:70::150]:19001'
  Role: 0
- ID: 6927563970833881373
  Address: '[fd00:dead:babe:70::151]:19001'
  Role: 0
- ID: 4732791891626147056
  Address: '[fd00:dead:babe:70::152]:19001'
  Role: 0
- ID: 5181569029997883717
  Address: '[fd00:dead:babe:70::211]:19001'
  Role: 1

I tried to use force remove the node, but with no luck even if the last line could be interpreted as the node being removed from dqlite:

root@k8s-1:~# microk8s remove-node '[fd00:dead:babe:70::211]' --force
Error from server (NotFound): nodes "[fd00:dead:babe:70::211]" not found
Node [fd00:dead:babe:70::211] does not exist in Kubernetes.
Attempting to remove [fd00:dead:babe:70::211] from dqlite.
Removing node entry found in dqlite.

Any help appreciated :)

goran-insby · 2024-04-29T07:44:51Z

Hi @zoc, unfortunately no luck, couple months later I reinstalled whole cluster. This extra issue in configuration didn't seemed to have big effect on cluster.

d33psky · 2025-01-16T15:41:21Z

Same issue here. I can add something for @zoc which worked for me :

Problem state :

sudo cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
- ID: 3297041220608546238
  Address: 10.0.0.33:19001
  Role: 0
- ID: 9700371103767736432
  Address: 10.0.0.35:19001
  Role: 2

The following command allowed me to clean that up :

microk8s remove-node '10.0.0.35' --force
Error from server (NotFound): nodes "10.0.0.35" not found
Node 10.0.0.35 does not exist in Kubernetes.
Attempting to remove 10.0.0.35 from dqlite.
Removing node entry found in dqlite.

After which the cluster.yaml showed that it got removed there too :

sudo cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
- ID: 3297041220608546238
  Address: 10.0.0.33:19001
  Role: 0

zoc · 2025-01-16T16:00:49Z

Thanks @d33psky, but unless I'm blind I cannot see any difference with what I have mentioned and tried to do already.

Tried again today as I've updated to latest release with no more luck.

d33psky · 2025-01-16T16:35:37Z

Was your cluster.yaml updated ?

Meanwhile I've reproduced the add-node problem that caused this broken state.

on node 2: snap remove microk8s
on node 1: microk8s remove-node '10.0.0.35' --force # to fix the cluster.yaml
on node 2: snap install microk8s --classic --channel=1.30 # that's the version this dev cluster needs to match prod
on node 1: microk8s.kubectl get node # lists node 1 being itself and is as expected
on node 2: microk8s.kubectl get node # lists node 2 being itself and is as expected
on node 1: microk8s add-node
on node 2: microk8s join 10.0.0.33:25000/3d3e7bbd3b828ce9c3ab19fcbea40b18/84ba49271d10

The last command produced

Contacting cluster at 10.0.0.33
Waiting for this node to finish joining the cluster. .. .. .. ..
Successfully joined the cluster.

And then we have the problem state back :

on node 1: kubectl get node # only shows node 1 (it should now list 2 nodes)
on node 1: cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml # shows 2 nodes, with a new ID for the second node. This part looks right.
on node 2: microk8s.kubectl get node # also only shows node 1. It does not list itself. This too should list 2 nodes.

Repeating the last command occasionally shows

The connection to the server 127.0.0.1:16443 was refused - did you specify the right host or port?

and a second later reports (only) node 1.

d33psky · 2025-01-16T17:30:27Z

Hey @ktsakalozos is there anything you want me to test on this cluster ? I probably have a few hours tomorrow to do so before I have to wipe both servers in an attempt to create a working cluster.
The hardware used is two Xeon servers with 32G ram and SSD. The node 1 has a bunch of pods running which I suspect creates some delay which the add-node cannot handle.

zoc · 2025-01-16T19:13:25Z

Was your cluster.yaml updated ?

No, it wasn't.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster.yaml shows node that does not exist and is not listed with kubectl get nodes #4311

cluster.yaml shows node that does not exist and is not listed with kubectl get nodes #4311

goran-insby commented Nov 19, 2023 •

edited

Loading

ktsakalozos commented Nov 20, 2023

goran-insby commented Nov 20, 2023 •

edited

Loading

zoc commented Apr 26, 2024

goran-insby commented Apr 29, 2024

d33psky commented Jan 16, 2025

zoc commented Jan 16, 2025

d33psky commented Jan 16, 2025

d33psky commented Jan 16, 2025

zoc commented Jan 16, 2025

cluster.yaml shows node that does not exist and is not listed with kubectl get nodes #4311

cluster.yaml shows node that does not exist and is not listed with kubectl get nodes #4311

Comments

goran-insby commented Nov 19, 2023 • edited Loading

Summary

What Should Happen Instead?

Reproduction Steps

ktsakalozos commented Nov 20, 2023

goran-insby commented Nov 20, 2023 • edited Loading

zoc commented Apr 26, 2024

goran-insby commented Apr 29, 2024

d33psky commented Jan 16, 2025

zoc commented Jan 16, 2025

d33psky commented Jan 16, 2025

d33psky commented Jan 16, 2025

zoc commented Jan 16, 2025

goran-insby commented Nov 19, 2023 •

edited

Loading

goran-insby commented Nov 20, 2023 •

edited

Loading