dqlite errors "failed to list /registry/events/<name>", cluster responding slow #4374

rlx-unlimited · 2024-01-18T14:02:52Z

Summary

the issue surfaced in several microk8s clusters. the frequency the errors occur in the system log increases as the response time ot the cluster gets slower. in my test scenario the performance of the cluster decreased up to the point where a node was marked "not ready".

i have tried a number of things to mitigate the error, but up to now, with little success:

removing the nodes in question (leave/remove), reseting microk8s on the node and rejoin
removing all the nodes but one (leave/remove)
https://microk8s.io/docs/restore-quorum

maybe i am mistaken, but i have the impression, the dqlite database is somehow corrupted and the error / skew increases over time.

in the process i learned a lot about k8s backends, how the cluster state is stored but i was unable to fix this. once the cluster was completely unresponsive i restarted the cluster as single node on a different hardware, exported everything and imported the complete cluster to a clean install (which was the solution given by some of posts with similar issues with microk8s / dqlite)

Reproduction Steps

i tried to reproduce the symptoms with a test cluster i created for that purpose. i was able to reproduce the issue by adding a very slow control plane node and by performing an unclean shutdown on a control plane node in a different installation.

the result is:
microk8s.daemon-k8s-dqlite[663]: time="2024-01-18T14:39:04+01:00" level=error msg="failed to list /registry/events// for revision 83525265"
spamming syslog

at the end the the cluster is in an unusable state.

the frequency increases with the number of object changes, new deployments, new namespaces, deletion of pods... that happen in the cluster.

FIX?

is it possible to fix this? at that point, i have a backup of the complete cluster state, and i am willing to try out suggestions. there does not seem to be a good error recovery / restore procedure for dqlite atm (dump the entire database and import the database into a clean one for example).

thanks for the hard work. any feedback is highly appreciated.

rlx-unlimited · 2024-02-09T09:49:42Z

has anybody a similar problem or an idea about the cause of the issue?

ktsakalozos · 2024-02-09T11:07:17Z

Hi @alex-s-team this smells similar to this #3227 . The root cause could be the same in both cases, a slow node. We are reviewing this PR canonical/k8s-dqlite#83. Would you be interested in giving it a go and provide some feedback. Try it with: sudo snap install microk8s --classic --channel=latest/edge/dqlite-list

rlx-unlimited · 2024-02-12T11:57:11Z

thanks for the update. i cannot try the patch on the production cluster, but i will try on a test cluster. after reading #3227 it definitely smells like the same issue. for the time being i mitigated the problem by draining and tainting the control plane nodes (should have done that anyhow) and moving the storage of the control plane nodes to dedicated nvmes.

dfry · 2024-03-01T08:51:10Z

Hi @ktsakalozos, I want to test this fix. In which release is this available? I tried using latest/edge but it seems broken for my installation (gives error on microk8s join command).

itsyoshio · 2024-03-04T19:29:29Z

Hi @alex-s-team this smells similar to this #3227 . The root cause could be the same in both cases, a slow node. We are reviewing this PR canonical/k8s-dqlite#83. Would you be interested in giving it a go and provide some feedback. Try it with: sudo snap install microk8s --classic --channel=latest/edge/dqlite-list

Switching from 1.28.3 to latest/edge/dqlite-list fixed my (mostly) unresponsive cluster completely.

sudo snap refresh microk8s --channel=latest/edge/dqlite-list

this also seems to be relevant for :

canonical/k8s-dqlite#82
#4307

alyssaruth · 2024-03-28T15:36:55Z

I believe we've hit this problem too - we've certainly observed the same dqlite logging:

Mar 27 16:41:10 k8s09 microk8s.daemon-k8s-dqlite[36205]: time="2024-03-27T16:41:10+01:00" level=error msg="failed to list /registry/metallb.io/l2advertisements/ for revision 833981"
Mar 27 16:41:10 k8s09 microk8s.daemon-k8s-dqlite[36205]: time="2024-03-27T16:41:10+01:00" level=error msg="failed to list /registry/statefulsets/ for revision 834818"
Mar 27 16:41:11 k8s09 microk8s.daemon-k8s-dqlite[36205]: time="2024-03-27T16:41:11+01:00" level=error msg="failed to list /registry/crd.projectcalico.org/bgppeers/ for revision 834055"
Mar 27 16:41:12 k8s09 microk8s.daemon-k8s-dqlite[36205]: time="2024-03-27T16:41:12+01:00" level=error msg="failed to list /registry/csinodes/ for revision 833911"

Our cluster becomes unresponsive and we see maxed out swap usage and full memory+I/O pressure:

The dqlite PR was merged back in February, and PRs were merged to backport it into 1.27, 1.28 etc around then too. However, I believe based on dates we still don't have the fix - our cluster is running 1.27.11, and according to snap this is from 19th February:

  1.27/stable:           v1.27.11 2024-02-20 (6542) 178MB classic
  1.27/candidate:        v1.27.11 2024-02-19 (6542) 178MB classic
  1.27/beta:             v1.27.11 2024-02-19 (6542) 178MB classic
  1.27/edge:             v1.27.12 2024-03-15 (6680) 178MB classic

Looks like there was a 1.27.12 cut this month - am I right in thinking that --channel=1.27/edge should now be sufficient for pulling in this change? Is there any easy way for me to verify the installed version of k8s-dqlite directly? I've found the binary, but sadly it doesn't support a --version flag... and there aren't patch notes for minor versions of microk8s either 😭

jblawatt · 2025-01-18T22:35:48Z

Hi, is there any update on this? I have the same problem with 1.32/stable on a single node cluster on Ubuntu 24.04.1 LTS. My journal is flooded with the error:

Jan 18 22:30:46 ... microk8s.daemon-k8s-dqlite[738050]: time="2025-01-18T22:30:46Z" level=error msg="failed to list /registry/events/homeassistant/ for revision 18090259"

I even deleted the whole homeassistant namespace without result. Is there any way to fix this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dqlite errors "failed to list /registry/events/<name>", cluster responding slow #4374

dqlite errors "failed to list /registry/events/<name>", cluster responding slow #4374

rlx-unlimited commented Jan 18, 2024 •

edited

Loading

rlx-unlimited commented Feb 9, 2024

ktsakalozos commented Feb 9, 2024

rlx-unlimited commented Feb 12, 2024

dfry commented Mar 1, 2024

itsyoshio commented Mar 4, 2024 •

edited

Loading

alyssaruth commented Mar 28, 2024

jblawatt commented Jan 18, 2025 •

edited

Loading

dqlite errors "failed to list /registry/events/<name>", cluster responding slow #4374

dqlite errors "failed to list /registry/events/<name>", cluster responding slow #4374

Comments

rlx-unlimited commented Jan 18, 2024 • edited Loading

Summary

Reproduction Steps

FIX?

rlx-unlimited commented Feb 9, 2024

ktsakalozos commented Feb 9, 2024

rlx-unlimited commented Feb 12, 2024

dfry commented Mar 1, 2024

itsyoshio commented Mar 4, 2024 • edited Loading

alyssaruth commented Mar 28, 2024

jblawatt commented Jan 18, 2025 • edited Loading

rlx-unlimited commented Jan 18, 2024 •

edited

Loading

itsyoshio commented Mar 4, 2024 •

edited

Loading

jblawatt commented Jan 18, 2025 •

edited

Loading