3 nodes etcd cluster with two node in high cpu usage #11012

phosae · 2019-08-09T13:48:45Z

system aws ec2, Linux version 4.14.128-112.105 x86_64
etcd version 3.3.11

start up configuration

etcd --name m1 --initial-advertise-peer-urls http://etcd1:2380 \
--listen-peer-urls http://etcd1:2380 \
--listen-client-urls http://etcd1:2379 \ 
--advertise-client-urls http://etcd1:2379 --debug=true --log-output=stdout \
--auth-token=jwt,pub-key=./jwt_RS256.pub,priv-key=./jwt_RS256.rsa, \
sign-method=RS256 --data-dir /data/etcd --initial-cluster-token my-cluster\
--initial-cluster m1=http://etcd1:2380,m2=http://etcd2:2380,m3=http://etcd3:2380 \
--initial-cluster-state new

m2, m2 startup shell are similar to m1

# we use gateway
/usr/local/bin/etcd gateway start --endpoints=etcd1:2379,etcd2:2379,etcd3:2379

top info

PID   USER  PR   NI  VIRT   RES   SHR    S  %CPU   %MEM  TIME+      COMMAND
10297 root  20   0   11.2g  94708 17812  R  394.0  0.6   963:57.14  etcd

PID   USER  PR   NI  VIRT   RES    SHR   S  %CPU   %MEM  TIME+    COMMAND
7787  root  20   0   11.2g  91764  17760 S  393.4  0.6   1019:19  etcd

PID   USER   PR  NI   VIRT   RES    SHR   S   %CPU  %MEM  TIME+   COMMAND
12748 root   20   0   11.1g  55472  17724 S   0.3   0.3   1:31.41 etcd

(pprof) top10
Showing nodes accounting for 65960ms, 87.63% of 75270ms total
Dropped 349 nodes (cum <= 376.35ms)
Showing top 10 nodes out of 72
      flat  flat%   sum%        cum   cum%
   18990ms 25.23% 25.23%    18990ms 25.23%  math/big.mulAddVWW
   14630ms 19.44% 44.67%    14630ms 19.44%  math/big.addMulVVW
   12920ms 17.16% 61.83%    12920ms 17.16%  math/big.subVV
   12220ms 16.23% 78.07%    47160ms 62.65%  math/big.nat.divLarge
    2470ms  3.28% 81.35%    17210ms 22.86%  math/big.basicSqr
    1150ms  1.53% 82.87%     1980ms  2.63%  runtime.scanobject
    1100ms  1.46% 84.34%     1100ms  1.46%  math/big.shlVU
     890ms  1.18% 85.52%      890ms  1.18%  runtime.memclrNoHeapPointers
     820ms  1.09% 86.61%      820ms  1.09%  math/big.greaterThan (inline)
     770ms  1.02% 87.63%      770ms  1.02%  math/big.shrVU

this happen the next day after I change root passwd through gateway:
echo "newpass"|etcdctl user passwd root --user="root:root" --interactive=false --endpoints="http://gateway:23790" (though this may not the direct cause)

The text was updated successfully, but these errors were encountered:

hexfusion · 2019-08-09T16:06:11Z

logs of the etcd instances would be helpful here.

phosae · 2019-08-10T01:07:54Z

logs of the etcd instances would be helpful here.

about 2019-08-09 10:50, the cluster switch to unstable mode, leader node changes frequently, as seeing log detail below.
we restart cluster (include the gateway) around 11:50, and later we cut off all the clients, the log goes normal, but cpu and virtual memory usage keep unbelievable high.

we enable debug log on etcd1 :
----------------------------------lots of these exception logs-------------------------------------
2019-08-09 10:57:50.500232 [pkg_logger.go:114] W | etcdserver: request "header:<ID:17006274213493527777 > authenticate:<name:root simple_token:>" with result "size:796" too
k too long (331.139139ms) to execute
2019-08-09 10:57:51.131115 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 71.742296ms)
2019-08-09 10:57:51.131153 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:57:51.131165 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 71.800355ms)
2019-08-09 10:57:51.131173 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:57:51.644504 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 123.516079ms)
2019-08-09 10:57:51.644542 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:57:51.644554 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 123.578241ms)
2019-08-09 10:57:51.644562 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:57:51.758637 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:19754
2019-08-09 10:57:52.253125 [pkg_logger.go:147] D | auth: jwt token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpNOLIaJmrc4l3Y2__
uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6
PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbf
Fkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcY
G2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI
2019-08-09 10:57:52.253169 [pkg_logger.go:147] D | auth: authorized root, token is eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpN
OLIaJmrc4l3Y2__uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbfFkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcYG2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI
2019-08-09 10:57:52.253239 [pkg_logger.go:114] W | etcdserver: request "header:<ID:17006274213493527779 > authenticate:<name:root simple_token:>" with result "size:796" took too long (369.255968ms) to execute
...
2019-08-09 10:59:04.482563 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 45.105602ms)
2019-08-09 10:59:04.482592 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:59:04.482602 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 45.154347ms)
...
2019-08-09 11:47:33.391078 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full
2019-08-09 11:47:33.391130 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]
2019-08-09 11:47:33.511899 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full
2019-08-09 11:47:33.511955 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]
2019-08-09 11:47:33.592380 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full
2019-08-09 11:47:33.592432 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]
2019-08-09 11:47:33.713162 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full
2019-08-09 11:47:33.713206 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]
----------restart etcd cluster around 11:50, log looks quite normal, but cpu and virtual mem usage still high-------------
2019-08-09 11:50:50.185269 [pkg_logger.go:147] D | rafthttp: failed to dial e5a61e8de658ec02 on stream MsgApp v2 (dial tcp etcd2:2380: connect: connection refused)
2019-08-09 11:50:50.185407 [pkg_logger.go:147] D | rafthttp: failed to dial e5a61e8de658ec02 on stream Message (dial tcp etcd3:2380: connect: connection refused)
2019-08-09 11:50:50.268598 [pkg_logger.go:134] I | raft: d9bfe1930e475fc is starting a new election at term 4366
2019-08-09 11:50:50.268620 [pkg_logger.go:134] I | raft: d9bfe1930e475fc became candidate at term 4367

2019-08-09 11:54:06.674422 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:54:21.674427 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:54:36.674458 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:54:51.674488 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:55:06.674434 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:55:21.637016 [pkg_logger.go:147] D | auth: jwt token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpNOLIaJmrc4l3Y2__uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbfFkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcYG2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI

--------error log on etcd 2, lots of go routine error---------
goroutine 50862 [select, 1 minutes]:
runtime.gopark(0x5633e10783f0, 0x0, 0x5633e09fda66, 0x6, 0x18, 0x1)
/opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/proc.go:291 +0x120 fp=0xc46b51f678 sp=0xc46b51f658 pc=0x5633e00cefb0
runtime.selectgo(0xc46b51f8e0, 0xc46b0e5080)
/opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/select.go:392 +0xe56 fp=0xc46b51f850 sp=0xc46b51f678 pc=0x5633e00ded06
github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch(0xc4205f3d00, 0x5633e108f560, 0xc46aa79070, 0x5633e107caa0, 0xc4205f3d00)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:151 +0x39b fp=0xc46b51fa08 sp=0xc46b51f850 pc=0x5633e08f543b
github.com/coreos/etcd/etcdserver/etcdserverpb._Watch_Watch_Handler(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e180, 0xc46b388540, 0x5633e00b2c4a, 0x20)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3481 +0xb4 fp=0xc46b51fa50 sp=0xc46b51fa08 pc=0x5633e04e0054
github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e180, 0xc46b388540, 0x19, 0xc46b40afa0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:71 +0xf7 fp=0xc46b51faa8 sp=0xc46b51fa50 pc=0x5633e08c8c47
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).StreamServerInterceptor.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0xc46b40af50, 0x5633e15dd638, 0xc46aeb42c0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:125 +0x12c fp=0xc46b51fb40 sp=0xc46b51faa8 pc=0x5633e08cc8ac
github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0x5633e15de820, 0xc46aeb6300)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:74 +0x9e fp=0xc46b51fb98 sp=0xc46b51fb40 pc=0x5633e08c8bee
github.com/coreos/etcd/etcdserver/api/v3rpc.newStreamInterceptor.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0xc46b40af50, 0x0, 0x0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:212 +0xf4 fp=0xc46b51fc28 sp=0xc46b51fb98 pc=0x5633e08f7c04
github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0x5633e1075158, 0x5633e04e0164, 0xc460c10bc0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:79 +0x168 fp=0xc46b51fc98 sp=0xc46b51fc28 pc=0x5633e08c8de8
google.golang.org/grpc.(*Server).processStreamingRPC(0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00, 0xc4205474a0, 0x5633e15681c0, 0x0, 0x0, 0x0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:957 +0x2cf fp=0xc46b51fdc8 sp=0xc46b51fc98 pc=0x5633e04b315f
google.golang.org/grpc.(*Server).handleStream(0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00, 0x0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:1044 +0x12b3 fp=0xc46b51ff80 sp=0xc46b51fdc8 pc=0x5633e04b4cb3
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc421783690, 0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:589 +0xa1 fp=0xc46b51ffb8 sp=0xc46b51ff80 pc=0x5633e04bad71
runtime.goexit()
/opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc46b51ffc0 sp=0xc46b51ffb8 pc=0x5633e00fcad1
created by google.golang.org/grpc.(*Server).serveStreams.func1
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:587 +
--------------------etcd3 just have warning log-------------------------
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport

FengXingYuXin · 2019-08-12T15:56:16Z

log maybe the result, not the reason。you can perf top -p pid to deal with cpu hot trouble。

phosae · 2019-08-20T03:43:39Z

Can some body help this issue?
perf top output are here:
$ perf top --sort comm,dso
99.34% etcd etcd
0.24% perf perf
0.10% perf [kernel]
$ perf top
9.33% etcd [.] 0x00000000006315cd
8.77% etcd [.] _start
2.54% etcd [.] 0x0000000000640a81
2.49% etcd [.] 0x000000000064081f
2.40% etcd [.] 0x0000000000640b48
2.19% etcd [.] 0x0000000000640ad1
2.19% etcd [.] 0x0000000000640aba
2.16% etcd [.] 0x0000000000640a97
2.11% etcd [.] 0x0000000000640824
2.09% etcd [.] 0x0000000000640b5e
2.03% etcd [.] 0x0000000000640b7e
1.94% etcd [.] 0x0000000000640aa3
1.94% etcd [.] 0x0000000000640ac5
1.89% etcd [.] 0x0000000000640811
1.88% etcd [.] 0x0000000000640807
1.83% etcd [.] 0x0000000000640a8c
1.80% etcd [.] 0x0000000000640aae
1.45% etcd [.] 0x0000000000640b6b
1.44% etcd [.] 0x0000000000640815
1.41% etcd [.] 0x000000000064081a
1.33% etcd [.] 0x0000000000640a7a
1.24% etcd [.] 0x0000000000640b53
1.15% etcd [.] 0x0000000000640b50
1.02% etcd [.] 0x0000000000640a84
1.01% etcd [.] 0x00000000006407fe
1.01% etcd [.] 0x0000000000640b4c
1.00% etcd [.] 0x00000000006407eb
0.91% etcd [.] 0x0000000000640b45
0.88% etcd [.] 0x0000000000640a9e
0.88% etcd [.] 0x000000000064080c
0.84% etcd [.] 0x0000000000640b57
0.81% [kernel] [k] __softirqentry_text_start
0.74% etcd [.] 0x0000000000640b76
0.72% etcd [.] 0x00000000006409c0
0.70% etcd [.] 0x0000000000640b6f
0.65% etcd [.] 0x00000000006409c4
0.64% etcd [.] 0x00000000006407f9
0.63% etcd [.] 0x0000000000640802
0.60% etcd [.] 0x0000000000640b79
0.60% etcd [.] 0x0000000000640a88
0.59% etcd [.] 0x0000000000640a9a
0.58% [kernel] [k] run_timer_softirq
0.56% etcd [.] 0x0000000000640b66
0.54% etcd [.] 0x0000000000640ab5
0.54% etcd [.] 0x0000000000640acc
0.53% etcd [.] 0x0000000000640b72
0.49% etcd [.] 0x000000000063147c
0.41% etcd [.] 0x0000000000640a7e
0.41% etcd [.] 0x00000000006407e8
0.38% etcd [.] 0x0000000000640ab1
0.35% etcd [.] 0x0000000000640a94
0.34% etcd [.] 0x0000000000640a22
0.32% etcd [.] 0x000000000062fdfd
0.31% etcd [.] 0x00000000006315eb
0.30% etcd [.] 0x0000000000640ac8
0.30% etcd [.] 0x00000000006313cf
0.30% etcd [.] 0x0000000000640b63
0.29% etcd [.] 0x0000000000640a26
0.28% [kernel] [k] __lock_text_start
0.27% etcd [.] 0x00000000006313d4
0.27% etcd [.] 0x0000000000631528
0.27% etcd [.] 0x000000000062fee6
0.26% etcd [.] 0x000000000057b1c7
0.25% etcd [.] 0x0000000000631457
0.24% etcd [.] 0x00000000006407ef
0.19% etcd [.] 0x000000000063140a
0.18% etcd [.] 0x000000000063155b
0.17% etcd [.] 0x0000000000640b41

$ perf top -g
Children Self Shared Object Symbol

34.94% 0.00% etcd [.] 0xffffaa3397c55ad1
34.89% 0.00% etcd [.] 0xffffaa3398013d71
34.89% 0.00% etcd [.] 0xffffaa339800dd1a
34.89% 0.00% etcd [.] 0xffffaa339800ab4a
34.89% 0.00% etcd [.] 0xffffaa3398038029
34.89% 0.00% etcd [.] 0xffffaa33984219ee
34.89% 0.00% etcd [.] 0xffffaa339845092d
34.89% 0.00% etcd [.] 0xffffaa33984217b1
34.89% 0.00% etcd [.] 0xffffaa339845073c
34.89% 0.00% etcd [.] 0xffffaa339842562d
34.89% 0.00% etcd [.] 0xffffaa339842182a
34.89% 0.00% etcd [.] 0xffffaa3398084ba8
34.89% 0.00% etcd [.] 0xffffaa339845312c
34.89% 0.00% etcd [.] 0xffffaa33984483ec
34.89% 0.00% etcd [.] 0xffffaa33983a697c
34.82% 0.00% etcd [.] 0xffffaa33983aa9f5
34.82% 0.00% etcd [.] 0xffffaa33983abff4
34.80% 0.00% etcd [.] 0xffffaa3398130647
34.80% 0.00% etcd [.] 0xffffaa339812da16
34.76% 0.00% etcd [.] 0xffffaa33981260cb
34.76% 0.00% etcd [.] 0xffffaa33981216a5
34.74% 0.00% etcd [.] 0xffffaa339811eb89
32.92% 0.00% etcd [.] 0xffffaa339811f499
32.52% 0.00% etcd [.] 0xffffaa339811ff72
32.06% 0.00% etcd [.] 0xffffaa3397d64538
32.03% 0.00% etcd [.] 0xffffaa3397d673c5
32.02% 0.00% etcd [.] 0xffffaa3397d027d8
22.61% 0.00% etcd [.] 0xffffaa3397d0e457
22.60% 0.04% perf [.] hist_entry_iter__add
22.59% 0.00% etcd [.] 0xffffaa3397d0be6b
12.14% 0.00% libc-2.26.so [.] __libc_start_main
12.14% 0.00% [unknown] [.] 0xd856258d4c544155
12.14% 0.00% perf [.] main
12.14% 0.00% perf [.] 0xffffffffffc9aae1
12.14% 0.00% perf [.] 0xffffffffffc9a7df
12.14% 0.00% perf [.] cmd_top
12.04% 0.00% perf [.] 0xffffffffffc4ce65
7.91% 7.01% etcd [.] _start
7.85% 0.00% etcd [.] 0xffffaa3397d0e34e
7.69% 0.00% etcd [.] 0xffffaa3397d0b1fb
7.47% 0.06% perf [.] callchain_append
7.12% 7.12% etcd [.] 0x00000000006315cd
$ perf top -e cycles:k
20.00% [kernel] [k] finish_task_switch
14.74% [kernel] [k] __lock_text_start
5.85% [kernel] [k] __softirqentry_text_start
5.82% [kernel] [k] tick_nohz_idle_enter
4.61% [kernel] [k] do_syscall_64
3.34% [kernel] [k] get_user_pages_fast
3.22% [kernel] [k] __sched_text_start
2.04% [kernel] [k] tick_nohz_idle_exit
1.53% [kernel] [k] __audit_syscall_entry
1.32% [kernel] [k] ena_io_poll
1.01% [kernel] [k] syscall_trace_enter
0.97% [kernel] [k] futex_wait_setup
0.89% [kernel] [k] exit_to_usermode_loop
0.88% [kernel] [k] __netif_receive_skb_core
0.86% [kernel] [k] schedule
0.73% [kernel] [k] mark_wake_futex
0.69% [kernel] [k] __audit_syscall_exit
0.62% [kernel] [k] futex_wait_queue_me
0.61% [kernel] [k] do_idle
0.56% [kernel] [k] __indirect_thunk_start
0.54% [kernel] [k] memset_erms
0.51% [kernel] [k] get_futex_key
0.51% [kernel] [k] __do_page_fault
0.49% [kernel] [k] validate_xmit_skb
0.49% [kernel] [k] dm_rh_inc_pending
0.48% [kernel] [k] mutex_lock
0.47% [kernel] [k] sys_pselect6
0.46% [kernel] [k] __vfs_read
0.46% [kernel] [k] dev_gro_receive
0.44% [kernel] [k] copy_user_generic_unrolled
0.43% [kernel] [k] poll_schedule_timeout
0.42% [kernel] [k] kmem_cache_free
0.42% [kernel] [k] futex_wait
0.40% [kernel] [k] quiet_vmstat
0.39% [kernel] [k] find_first_bit

jingyih · 2019-08-21T07:53:03Z

As a sanity check, make sure all 3 etcd servers are up and running, and they are connected to each other.

phosae · 2019-08-26T08:17:24Z

As a sanity check, make sure all 3 etcd servers are up and running, and they are connected to each other.

Every time we restart the cluster, each node are up and running and they are connected to each other. But several day later, CPU overload problem will make node failed to receive heart beat from each other.

The only thing we can do is restarting the cluster, but it can't help decreasing the high cpu and virtual memory usage.

weopm · 2020-01-08T03:44:55Z

this bug fixed in [#10218 ]

phosae · 2020-01-08T15:33:07Z

this bug fixed in [#10218]

thx, this problem doesn't reappear since we update our etcd cluster.

liujjcdl · 2020-01-10T08:14:43Z

@weopm May I know if etcd 3.3.12 still has the issue? We have similar high CPU usage issue on etcd 3.3.12. Thanks!

weopm · 2020-01-10T08:30:30Z

this bug is still exit on etcd 3.3.18. Maybe you can do a patch with (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560) or update to etcd 4.4.3.

liujjcdl · 2020-01-10T08:49:21Z

@weopm thanks for the quick reply, to double confirm, should it be 3.4.3 has the fix?

weopm · 2020-01-10T09:19:55Z

this bug is fixed on etcd 4.4.3, not 3.4.3. you can review and confirm the code by yourself with (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560).

cjquinon · 2020-01-21T14:24:09Z

@weopm - We're seeing this issue in production workloads on 3.3.12. I noticed that there was a cherry-pick commit for 3.4, which was merged, and a cherry-pick for 3.3, which was closed and not merged.

Is it possible to get this fix into 3.3 and create a new patch release?

I do see it's in the v3.4.3 tag as well: https://github.com/etcd-io/etcd/blob/v3.4.3/etcdserver/v3_server.go#L607

cjquinon · 2020-01-21T14:36:18Z

PR to cherry-pick into 3.3: #11378

phosae closed this as completed Jan 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3 nodes etcd cluster with two node in high cpu usage #11012

3 nodes etcd cluster with two node in high cpu usage #11012

phosae commented Aug 9, 2019 •

edited

Loading

hexfusion commented Aug 9, 2019

phosae commented Aug 10, 2019 •

edited

Loading

FengXingYuXin commented Aug 12, 2019

phosae commented Aug 20, 2019

jingyih commented Aug 21, 2019

phosae commented Aug 26, 2019

weopm commented Jan 8, 2020

phosae commented Jan 8, 2020

liujjcdl commented Jan 10, 2020

weopm commented Jan 10, 2020

liujjcdl commented Jan 10, 2020

weopm commented Jan 10, 2020

cjquinon commented Jan 21, 2020

cjquinon commented Jan 21, 2020

3 nodes etcd cluster with two node in high cpu usage #11012

3 nodes etcd cluster with two node in high cpu usage #11012

Comments

phosae commented Aug 9, 2019 • edited Loading

hexfusion commented Aug 9, 2019

phosae commented Aug 10, 2019 • edited Loading

FengXingYuXin commented Aug 12, 2019

phosae commented Aug 20, 2019

jingyih commented Aug 21, 2019

phosae commented Aug 26, 2019

weopm commented Jan 8, 2020

phosae commented Jan 8, 2020

liujjcdl commented Jan 10, 2020

weopm commented Jan 10, 2020

liujjcdl commented Jan 10, 2020

weopm commented Jan 10, 2020

cjquinon commented Jan 21, 2020

cjquinon commented Jan 21, 2020

phosae commented Aug 9, 2019 •

edited

Loading

phosae commented Aug 10, 2019 •

edited

Loading