Replies: 5 comments
-
I don't have enough information about your environment to determine what your different clusters might have in common. The fact that all the environments are doing it at the same time makes it pretty clear that there is a common environmental cause. What else is going on in your environment when the datastore latency increases. Backups? Storage maintenance? Periods of high load on some other system that shares infrastructure with your cluster? Higher than usual demand placed upon the Kubernetes datastore by similarly configured workloads? |
Beta Was this translation helpful? Give feedback.
-
Thanks for your help. There aren't any similarly configured workloads we know of (and no unusual load in the clusters before the increase in transaction times begins). We replaced the galera cluster for the development environment with a single MariaDB node. Are there any known issues with k3s in combination with MariaDB Galera (2 or 3 nodes)? Dev:
Playground:
|
Beta Was this translation helpful? Give feedback.
-
In general, any multi-master SQL backend that has
That doesn't explain why performance drops on both clusters at the same time though. There are no scheduled operations built in to K3s; so I continue to suspect that it's something common to or shared between your environments. |
Beta Was this translation helpful? Give feedback.
-
Since we switched to single database nodes our clusters didn't experience any crashes.
This does not seem to affect the clusters in any way, all nodes work just fine. |
Beta Was this translation helpful? Give feedback.
-
I'm going to convert this to a discussion - it seems like there are some remaining questions but no bug. |
Beta Was this translation helpful? Give feedback.
-
Environmental Info:
K3s Version:
v1.21.5+k3s2
Node(s) CPU architecture, OS, and Version:
KVM:
OS type and version: Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-65-generic x86_64)
CPU per node: 8
Memory per node: 8 GB
Server:
OS type and version: Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-65-generic x86_64)
CPU per node: 32
Memory per node: 252 GB
Disk type: LVM, XFS, local disks
Network bandwidth and latency between the nodes: 2x 1GBit LACP, 0.07ms ping average
Underlying Infrastructure: Baremetal
Cluster Configuration:
Playground environment:
(Play-Rancher) Rancher: 2 masters (KVM)
(Play-App) Cluster imported in Rancher: 2 masters + 1 worker (KVM)
Galera as a config store.
Dev environment:
(Dev-Rancher) Rancher: 2 masters (KVM)
(Dev-App) Cluster imported in Rancher: 2 masters + 1 worker (Server)
Galera as a config store (different instance on different hardware, only similarity with playground is Fibre Channel).
The clusters use multiple but shared docker registries.
Describe the bug:
From time to time, we see increases in the reported transaction time. This causes all 4 the clusters to crash simultaneously.
We don't see any significant packet loss in the network and the clusters use databases on different hardware.
The connection limit of the databases is not reached. Transactions times are low most of the time (for weeks and months), therefore we don't think this is a problem with the hardware we use.
Here are logs for all clusters:
(Dev-Rancher) KVM:
(Dev-App) Server:
(Play-App) KVM:
(Play-Rancher) KVM:
Steps To Reproduce:
We used the installation script and configured the data store like this:
Expected behavior:
The clusters should not crash and log more information that can be used to track down the issue.
Actual behavior:
Some external dependency or similar configuration of the clusters causes the clusters to crash.
Beta Was this translation helpful? Give feedback.
All reactions