Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster.yaml shows node that does not exist and is not listed with kubectl get nodes #4311

Open
goran-insby opened this issue Nov 19, 2023 · 9 comments

Comments

@goran-insby
Copy link

goran-insby commented Nov 19, 2023

Summary

In cluster.yaml I can see IP address of node that was removed some time ago. This node that is listed in cluster.yaml is not shown with kubectl get nodes [which is correct].

The problem is that calico tries to contact this server constantly, which I think leads to dqlite going crazy and spiking over 100%.

This situation looks like "lost quorum" but node is listed only in cluster.yaml and does not appear with kubectl get nodes, which is strange.

What Should Happen Instead?

Expected behaviour is that cluster.yaml is aligned with output from kubectl get nodes

Reproduction Steps

This is situation on one cluster. I tried to stop all nodes, remove manually ip address from cluster.yaml and start it, but unfortunately it came back to cluster.yaml, so this is not the place to make the change, I assume.

@ktsakalozos
Copy link
Member

Hi @goran-insby thank you for reporting this. Could you please describe in more details how the cluster reached that state? What are the steps to follow to get the node removed but still in cluster.yaml.

Could you please share an microk8s inspect tarball or at least the cluster.yaml? I am insterested to know what was the role of the offending node.

@goran-insby
Copy link
Author

goran-insby commented Nov 20, 2023

Hi @ktsakalozos, role in cluster.yaml that is assigned is 2, which would mean spare, I think.
Node was part of the cluster, then it was removed from the cluster with "microk8s leave".

@zoc
Copy link

zoc commented Apr 26, 2024

Hi @goran-insby,

Where you able to find a solution for this ?

I'm in a similar situation: node removed from the cluster but still appearing in cluster.yaml and as a consequence in the "datastore standby nodes" list when running microk8s status.

I have however not seen any weird behavior from dqlite itself nor calico, but I would like to remove this to avoid potential future burden. To answer @ktsakalozos's question, in my case the dead node has role set to '1' in cluster.yaml:

- ID: 3297041220608546238
  Address: '[fd00:dead:babe:70::150]:19001'
  Role: 0
- ID: 6927563970833881373
  Address: '[fd00:dead:babe:70::151]:19001'
  Role: 0
- ID: 4732791891626147056
  Address: '[fd00:dead:babe:70::152]:19001'
  Role: 0
- ID: 5181569029997883717
  Address: '[fd00:dead:babe:70::211]:19001'
  Role: 1

I tried to use force remove the node, but with no luck even if the last line could be interpreted as the node being removed from dqlite:

root@k8s-1:~# microk8s remove-node '[fd00:dead:babe:70::211]' --force
Error from server (NotFound): nodes "[fd00:dead:babe:70::211]" not found
Node [fd00:dead:babe:70::211] does not exist in Kubernetes.
Attempting to remove [fd00:dead:babe:70::211] from dqlite.
Removing node entry found in dqlite.

Any help appreciated :)

@goran-insby
Copy link
Author

Hi @zoc, unfortunately no luck, couple months later I reinstalled whole cluster. This extra issue in configuration didn't seemed to have big effect on cluster.

@d33psky
Copy link

d33psky commented Jan 16, 2025

Same issue here. I can add something for @zoc which worked for me :

Problem state :

sudo cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
- ID: 3297041220608546238
  Address: 10.0.0.33:19001
  Role: 0
- ID: 9700371103767736432
  Address: 10.0.0.35:19001
  Role: 2

The following command allowed me to clean that up :

microk8s remove-node '10.0.0.35' --force
Error from server (NotFound): nodes "10.0.0.35" not found
Node 10.0.0.35 does not exist in Kubernetes.
Attempting to remove 10.0.0.35 from dqlite.
Removing node entry found in dqlite.

After which the cluster.yaml showed that it got removed there too :

sudo cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml
- ID: 3297041220608546238
  Address: 10.0.0.33:19001
  Role: 0

@zoc
Copy link

zoc commented Jan 16, 2025

Thanks @d33psky, but unless I'm blind I cannot see any difference with what I have mentioned and tried to do already.

Tried again today as I've updated to latest release with no more luck.

@d33psky
Copy link

d33psky commented Jan 16, 2025

Was your cluster.yaml updated ?

Meanwhile I've reproduced the add-node problem that caused this broken state.

  • on node 2: snap remove microk8s
  • on node 1: microk8s remove-node '10.0.0.35' --force # to fix the cluster.yaml
  • on node 2: snap install microk8s --classic --channel=1.30 # that's the version this dev cluster needs to match prod
  • on node 1: microk8s.kubectl get node # lists node 1 being itself and is as expected
  • on node 2: microk8s.kubectl get node # lists node 2 being itself and is as expected
  • on node 1: microk8s add-node
  • on node 2: microk8s join 10.0.0.33:25000/3d3e7bbd3b828ce9c3ab19fcbea40b18/84ba49271d10

The last command produced

Contacting cluster at 10.0.0.33
Waiting for this node to finish joining the cluster. .. .. .. ..
Successfully joined the cluster.

And then we have the problem state back :

  • on node 1: kubectl get node # only shows node 1 (it should now list 2 nodes)
  • on node 1: cat /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml # shows 2 nodes, with a new ID for the second node. This part looks right.
  • on node 2: microk8s.kubectl get node # also only shows node 1. It does not list itself. This too should list 2 nodes.

Repeating the last command occasionally shows

The connection to the server 127.0.0.1:16443 was refused - did you specify the right host or port?

and a second later reports (only) node 1.

@d33psky
Copy link

d33psky commented Jan 16, 2025

Hey @ktsakalozos is there anything you want me to test on this cluster ? I probably have a few hours tomorrow to do so before I have to wipe both servers in an attempt to create a working cluster.
The hardware used is two Xeon servers with 32G ram and SSD. The node 1 has a bunch of pods running which I suspect creates some delay which the add-node cannot handle.

@zoc
Copy link

zoc commented Jan 16, 2025

Was your cluster.yaml updated ?

No, it wasn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants