Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dqlite errors "failed to list /registry/events/<name>", cluster responding slow #4374

Open
rlx-unlimited opened this issue Jan 18, 2024 · 7 comments

Comments

@rlx-unlimited
Copy link

rlx-unlimited commented Jan 18, 2024

Summary

the issue surfaced in several microk8s clusters. the frequency the errors occur in the system log increases as the response time ot the cluster gets slower. in my test scenario the performance of the cluster decreased up to the point where a node was marked "not ready".

i have tried a number of things to mitigate the error, but up to now, with little success:

maybe i am mistaken, but i have the impression, the dqlite database is somehow corrupted and the error / skew increases over time.

in the process i learned a lot about k8s backends, how the cluster state is stored but i was unable to fix this. once the cluster was completely unresponsive i restarted the cluster as single node on a different hardware, exported everything and imported the complete cluster to a clean install (which was the solution given by some of posts with similar issues with microk8s / dqlite)

Reproduction Steps

i tried to reproduce the symptoms with a test cluster i created for that purpose. i was able to reproduce the issue by adding a very slow control plane node and by performing an unclean shutdown on a control plane node in a different installation.

the result is:
microk8s.daemon-k8s-dqlite[663]: time="2024-01-18T14:39:04+01:00" level=error msg="failed to list /registry/events// for revision 83525265"
spamming syslog

at the end the the cluster is in an unusable state.

the frequency increases with the number of object changes, new deployments, new namespaces, deletion of pods... that happen in the cluster.

FIX?

is it possible to fix this? at that point, i have a backup of the complete cluster state, and i am willing to try out suggestions. there does not seem to be a good error recovery / restore procedure for dqlite atm (dump the entire database and import the database into a clean one for example).

thanks for the hard work. any feedback is highly appreciated.

@rlx-unlimited
Copy link
Author

has anybody a similar problem or an idea about the cause of the issue?

@ktsakalozos
Copy link
Member

Hi @alex-s-team this smells similar to this #3227 . The root cause could be the same in both cases, a slow node. We are reviewing this PR canonical/k8s-dqlite#83. Would you be interested in giving it a go and provide some feedback. Try it with: sudo snap install microk8s --classic --channel=latest/edge/dqlite-list

@rlx-unlimited
Copy link
Author

thanks for the update. i cannot try the patch on the production cluster, but i will try on a test cluster. after reading #3227 it definitely smells like the same issue. for the time being i mitigated the problem by draining and tainting the control plane nodes (should have done that anyhow) and moving the storage of the control plane nodes to dedicated nvmes.

@dfry
Copy link

dfry commented Mar 1, 2024

Hi @ktsakalozos, I want to test this fix. In which release is this available? I tried using latest/edge but it seems broken for my installation (gives error on microk8s join command).

@itsyoshio
Copy link

itsyoshio commented Mar 4, 2024

Hi @alex-s-team this smells similar to this #3227 . The root cause could be the same in both cases, a slow node. We are reviewing this PR canonical/k8s-dqlite#83. Would you be interested in giving it a go and provide some feedback. Try it with: sudo snap install microk8s --classic --channel=latest/edge/dqlite-list

Switching from 1.28.3 to latest/edge/dqlite-list fixed my (mostly) unresponsive cluster completely.

sudo snap refresh microk8s --channel=latest/edge/dqlite-list

this also seems to be relevant for :

canonical/k8s-dqlite#82
#4307

@alyssaruth
Copy link

I believe we've hit this problem too - we've certainly observed the same dqlite logging:

Mar 27 16:41:10 k8s09 microk8s.daemon-k8s-dqlite[36205]: time="2024-03-27T16:41:10+01:00" level=error msg="failed to list /registry/metallb.io/l2advertisements/ for revision 833981"
Mar 27 16:41:10 k8s09 microk8s.daemon-k8s-dqlite[36205]: time="2024-03-27T16:41:10+01:00" level=error msg="failed to list /registry/statefulsets/ for revision 834818"
Mar 27 16:41:11 k8s09 microk8s.daemon-k8s-dqlite[36205]: time="2024-03-27T16:41:11+01:00" level=error msg="failed to list /registry/crd.projectcalico.org/bgppeers/ for revision 834055"
Mar 27 16:41:12 k8s09 microk8s.daemon-k8s-dqlite[36205]: time="2024-03-27T16:41:12+01:00" level=error msg="failed to list /registry/csinodes/ for revision 833911"

Our cluster becomes unresponsive and we see maxed out swap usage and full memory+I/O pressure:
image

The dqlite PR was merged back in February, and PRs were merged to backport it into 1.27, 1.28 etc around then too. However, I believe based on dates we still don't have the fix - our cluster is running 1.27.11, and according to snap this is from 19th February:

  1.27/stable:           v1.27.11 2024-02-20 (6542) 178MB classic
  1.27/candidate:        v1.27.11 2024-02-19 (6542) 178MB classic
  1.27/beta:             v1.27.11 2024-02-19 (6542) 178MB classic
  1.27/edge:             v1.27.12 2024-03-15 (6680) 178MB classic

Looks like there was a 1.27.12 cut this month - am I right in thinking that --channel=1.27/edge should now be sufficient for pulling in this change? Is there any easy way for me to verify the installed version of k8s-dqlite directly? I've found the binary, but sadly it doesn't support a --version flag... and there aren't patch notes for minor versions of microk8s either 😭

@jblawatt
Copy link

jblawatt commented Jan 18, 2025

Hi, is there any update on this? I have the same problem with 1.32/stable on a single node cluster on Ubuntu 24.04.1 LTS. My journal is flooded with the error:

Jan 18 22:30:46 ... microk8s.daemon-k8s-dqlite[738050]: time="2025-01-18T22:30:46Z" level=error msg="failed to list /registry/events/homeassistant/ for revision 18090259"

I even deleted the whole homeassistant namespace without result. Is there any way to fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants