-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The etcd(v3.5.2) process often causes the server's iowait to become high #13879
Comments
I'm having trouble understanding the full setup. Is the etcd dataDir on I don't see how etcd could cause kube-scheduler to kube-controller to read X MB/s from disk... |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
@dotbalo have you found the workaround on this? |
@MohamedShaj It seems that this is not a problem of etcd, but a problem of disk performance. We have replaced the disks with SSDs of the highest performance and increased the storage space, and then run all the processes on the high-performance disks. Since then, iowait has rarely been high. |
so after changing the disk with high I/O also, sometimes you see these logs right, I was facing this logs issue in k3s cluster, I have 3 node clusters where the master and worker are in the same node in my case, i came to know that etcd data dir should be a separate disk and needs to have a good IO disk and better-performing CPU and memory, so after all tuning i can see the logs were drastically reduced,for 4 hours once I can see only 2 logs or 3 related readIndex took long time logs, So just want to understand, some or other reason the logs will be coming right but this will not our node unstable? |
What happened?
I deployed a k8s cluster with three master nodes and 15 worker nodes.This cluster runs about 367 nginx pods, just for testing.There are also three etcd nodes to form a cluster. etcd and apiserver are deployed on the same node.Recently, it is often found that the iowait of the master node has become high, and it can be restored to normal after restarting etcd.If I don't restart etcd, iowait will never come down.
I tried to find the root cause of the problem with iostat, top, iotop, lsof, but none of them found the cause, and the problem appears irregularly.
What did you expect to happen?
Hope to find the root cause of iowait rise and fix it.
How can we reproduce it (as minimally and precisely as possible)?
My deployed environment information is as follows:
iowait becomes high from time to time, not often. But as long as it appears, the etcd process must be restarted to descend.
Anything else we need to know?
Here are some records I checked:
iostat -x 1
It can be seen that the r/s of sda is full, but the etcd data directory is /var/lib/etcd, which is a separate SSD hard disk.
iotop --only
Although kube-apiserver and controller are also high at this time, they can be restored after restarting etcd.
top
lsof, 1338 is the main process PID of etcd
Check iotop again after restarting etcd
Etcd version (please run commands below)
etcdctl is not installed
Etcd configuration (command line flags or environment variables)
cat /usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Service
Documentation=https://coreos.com/etcd/docs/latest/
After=network.target
[Service]
Type=notify
ExecStart=/usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml
Restart=on-failure
RestartSec=10
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
Alias=etcd3.service
cat /etc/etcd/etcd.config.yml
name: 'xxx-k8s-master02-168-0-22'
data-dir: /var/lib/etcd
wal-dir: /var/lib/etcd/wal
snapshot-count: 5000
heartbeat-interval: 100
election-timeout: 1000
quota-backend-bytes: 0
listen-peer-urls: 'https://192.168.0.22:2380'
listen-client-urls: 'https://192.168.0.22:2379,http://127.0.0.1:2379'
max-snapshots: 3
max-wals: 5
cors:
initial-advertise-peer-urls: 'https://192.168.0.22:2380'
advertise-client-urls: 'https://192.168.0.22:2379'
discovery:
discovery-fallback: 'proxy'
discovery-proxy:
discovery-srv:
initial-cluster: 'xxx-k8s-master01-168-0-21=https://10.97.43.21:2380,xxx-k8s-master02-168-0-22=https://192.168.0.22:2380,servicecloud-k
8s-master03-168-0-23=https://10.97.43.23:2380'
initial-cluster-token: 'etcd-k8s-cluster'
initial-cluster-state: 'existing'
strict-reconfig-check: false
enable-v2: true
enable-pprof: true
proxy: 'off'
proxy-failure-wait: 5000
proxy-refresh-interval: 30000
proxy-dial-timeout: 1000
proxy-write-timeout: 5000
proxy-read-timeout: 0
client-transport-security:
cert-file: '/etc/kubernetes/pki/etcd/etcd.pem'
key-file: '/etc/kubernetes/pki/etcd/etcd-key.pem'
client-cert-auth: true
trusted-ca-file: '/etc/kubernetes/pki/etcd/etcd-ca.pem'
auto-tls: true
peer-transport-security:
cert-file: '/etc/kubernetes/pki/etcd/etcd.pem'
key-file: '/etc/kubernetes/pki/etcd/etcd-key.pem'
peer-client-cert-auth: true
trusted-ca-file: '/etc/kubernetes/pki/etcd/etcd-ca.pem'
auto-tls: true
debug: false
log-package-levels:
log-outputs: [default]
force-new-cluster: false
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
The text was updated successfully, but these errors were encountered: