-
Notifications
You must be signed in to change notification settings - Fork 779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dqlite errors "failed to list /registry/events/<name>", cluster responding slow #4374
Comments
has anybody a similar problem or an idea about the cause of the issue? |
Hi @alex-s-team this smells similar to this #3227 . The root cause could be the same in both cases, a slow node. We are reviewing this PR canonical/k8s-dqlite#83. Would you be interested in giving it a go and provide some feedback. Try it with: |
thanks for the update. i cannot try the patch on the production cluster, but i will try on a test cluster. after reading #3227 it definitely smells like the same issue. for the time being i mitigated the problem by draining and tainting the control plane nodes (should have done that anyhow) and moving the storage of the control plane nodes to dedicated nvmes. |
Hi @ktsakalozos, I want to test this fix. In which release is this available? I tried using latest/edge but it seems broken for my installation (gives error on microk8s join command). |
Switching from
this also seems to be relevant for : |
Hi, is there any update on this? I have the same problem with 1.32/stable on a single node cluster on Ubuntu 24.04.1 LTS. My journal is flooded with the error:
I even deleted the whole homeassistant namespace without result. Is there any way to fix this? |
Summary
the issue surfaced in several microk8s clusters. the frequency the errors occur in the system log increases as the response time ot the cluster gets slower. in my test scenario the performance of the cluster decreased up to the point where a node was marked "not ready".
i have tried a number of things to mitigate the error, but up to now, with little success:
maybe i am mistaken, but i have the impression, the dqlite database is somehow corrupted and the error / skew increases over time.
in the process i learned a lot about k8s backends, how the cluster state is stored but i was unable to fix this. once the cluster was completely unresponsive i restarted the cluster as single node on a different hardware, exported everything and imported the complete cluster to a clean install (which was the solution given by some of posts with similar issues with microk8s / dqlite)
Reproduction Steps
i tried to reproduce the symptoms with a test cluster i created for that purpose. i was able to reproduce the issue by adding a very slow control plane node and by performing an unclean shutdown on a control plane node in a different installation.
the result is:
microk8s.daemon-k8s-dqlite[663]: time="2024-01-18T14:39:04+01:00" level=error msg="failed to list /registry/events// for revision 83525265"
spamming syslog
at the end the the cluster is in an unusable state.
the frequency increases with the number of object changes, new deployments, new namespaces, deletion of pods... that happen in the cluster.
FIX?
is it possible to fix this? at that point, i have a backup of the complete cluster state, and i am willing to try out suggestions. there does not seem to be a good error recovery / restore procedure for dqlite atm (dump the entire database and import the database into a clean one for example).
thanks for the hard work. any feedback is highly appreciated.
The text was updated successfully, but these errors were encountered: