Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster is not running after enabling gpu addon #4258

Closed
Brav-o opened this issue Oct 18, 2023 · 4 comments
Closed

Cluster is not running after enabling gpu addon #4258

Brav-o opened this issue Oct 18, 2023 · 4 comments

Comments

@Brav-o
Copy link

Brav-o commented Oct 18, 2023

Summary

I have a 4 nodes cluster that was working for a few weeks now. I tried to enable the gpu addon with microk8s enable gpu and now I cannot start any node on the cluster : microk8s is not running, try microk8s start
I tried stopping and starting all nodes but they just dont restart, and I cant run any kubectl commands (dial tcp 192.168.10.181:16443: connect: connection refused - error from a previous attempt: unexpected EOF).

What Should Happen Instead?

All nodes should start without problem

Introspection Report

$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite

Building the report tarball

inspection-report-20231018_175146.tar.gz

Are you interested in contributing with a fix?

I cannot

@berkayoz
Copy link
Member

Hey @Brav-o,

Did this condition start right after enabling the gpu addon? Or did it happen a while after enabling the addon?
Based on this I'll try to reproduce and get back to you.

Many thanks.

@Brav-o
Copy link
Author

Brav-o commented Oct 19, 2023

Hi,
It happend right after enabling it.

@berkayoz
Copy link
Member

Hey @Brav-o ,
I was not able to reproduce the issue as you mentioned.

But from the inspection reports the storage backend reports some problems, so for recovery could you try the instructions at https://microk8s.io/docs/restore-quorum while selecting one of your leader nodes in your case for example 192.168.10.180. You can rejoin your other nodes again and hopefully this should help with recovery of your cluster.

@Brav-o
Copy link
Author

Brav-o commented Oct 23, 2023

Hi,
I managed to recover the volumes I was using with longhorn and I juste recreated the whole cluster.
I didn't know about this page for restoring a cluster but I'll try it.
I still don't know why the GPU add-on crashed the cluster because I retried it with a backup of the vm and it worked without problems
Thank you for your help :)

@Brav-o Brav-o closed this as completed Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants