Cluster is not running after enabling gpu addon #4258

Brav-o · 2023-10-18T17:55:29Z

Summary

I have a 4 nodes cluster that was working for a few weeks now. I tried to enable the gpu addon with microk8s enable gpu and now I cannot start any node on the cluster : microk8s is not running, try microk8s start
I tried stopping and starting all nodes but they just dont restart, and I cant run any kubectl commands (dial tcp 192.168.10.181:16443: connect: connection refused - error from a previous attempt: unexpected EOF).

What Should Happen Instead?

All nodes should start without problem

Introspection Report

$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite

Building the report tarball

inspection-report-20231018_175146.tar.gz

Are you interested in contributing with a fix?

I cannot

The text was updated successfully, but these errors were encountered:

berkayoz · 2023-10-19T11:45:32Z

Hey @Brav-o,

Did this condition start right after enabling the gpu addon? Or did it happen a while after enabling the addon?
Based on this I'll try to reproduce and get back to you.

Many thanks.

Brav-o · 2023-10-19T17:36:47Z

Hi,
It happend right after enabling it.

berkayoz · 2023-10-22T13:01:51Z

Hey @Brav-o ,
I was not able to reproduce the issue as you mentioned.

But from the inspection reports the storage backend reports some problems, so for recovery could you try the instructions at https://microk8s.io/docs/restore-quorum while selecting one of your leader nodes in your case for example 192.168.10.180. You can rejoin your other nodes again and hopefully this should help with recovery of your cluster.

Brav-o · 2023-10-23T09:12:34Z

Hi,
I managed to recover the volumes I was using with longhorn and I juste recreated the whole cluster.
I didn't know about this page for restoring a cluster but I'll try it.
I still don't know why the GPU add-on crashed the cluster because I retried it with a backup of the vm and it worked without problems
Thank you for your help :)

Brav-o closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster is not running after enabling gpu addon #4258

Cluster is not running after enabling gpu addon #4258

Brav-o commented Oct 18, 2023 •

edited

Loading

berkayoz commented Oct 19, 2023

Brav-o commented Oct 19, 2023

berkayoz commented Oct 22, 2023

Brav-o commented Oct 23, 2023

Cluster is not running after enabling gpu addon #4258

Cluster is not running after enabling gpu addon #4258

Comments

Brav-o commented Oct 18, 2023 • edited Loading

Summary

What Should Happen Instead?

Introspection Report

Are you interested in contributing with a fix?

berkayoz commented Oct 19, 2023

Brav-o commented Oct 19, 2023

berkayoz commented Oct 22, 2023

Brav-o commented Oct 23, 2023

Brav-o commented Oct 18, 2023 •

edited

Loading