Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 nodes etcd cluster with two node in high cpu usage #11012

Closed
phosae opened this issue Aug 9, 2019 · 14 comments
Closed

3 nodes etcd cluster with two node in high cpu usage #11012

phosae opened this issue Aug 9, 2019 · 14 comments

Comments

@phosae
Copy link

phosae commented Aug 9, 2019

system aws ec2, Linux version 4.14.128-112.105 x86_64
etcd version 3.3.11

start up configuration

etcd --name m1 --initial-advertise-peer-urls http://etcd1:2380 \
--listen-peer-urls http://etcd1:2380 \
--listen-client-urls http://etcd1:2379 \ 
--advertise-client-urls http://etcd1:2379 --debug=true --log-output=stdout \
--auth-token=jwt,pub-key=./jwt_RS256.pub,priv-key=./jwt_RS256.rsa, \
sign-method=RS256 --data-dir /data/etcd --initial-cluster-token my-cluster\
--initial-cluster m1=http://etcd1:2380,m2=http://etcd2:2380,m3=http://etcd3:2380 \
--initial-cluster-state new

m2, m2 startup shell are similar to m1

# we use gateway
/usr/local/bin/etcd gateway start --endpoints=etcd1:2379,etcd2:2379,etcd3:2379

top info

PID   USER  PR   NI  VIRT   RES   SHR    S  %CPU   %MEM  TIME+      COMMAND
10297 root  20   0   11.2g  94708 17812  R  394.0  0.6   963:57.14  etcd

PID   USER  PR   NI  VIRT   RES    SHR   S  %CPU   %MEM  TIME+    COMMAND
7787  root  20   0   11.2g  91764  17760 S  393.4  0.6   1019:19  etcd

PID   USER   PR  NI   VIRT   RES    SHR   S   %CPU  %MEM  TIME+   COMMAND
12748 root   20   0   11.1g  55472  17724 S   0.3   0.3   1:31.41 etcd
(pprof) top10
Showing nodes accounting for 65960ms, 87.63% of 75270ms total
Dropped 349 nodes (cum <= 376.35ms)
Showing top 10 nodes out of 72
      flat  flat%   sum%        cum   cum%
   18990ms 25.23% 25.23%    18990ms 25.23%  math/big.mulAddVWW
   14630ms 19.44% 44.67%    14630ms 19.44%  math/big.addMulVVW
   12920ms 17.16% 61.83%    12920ms 17.16%  math/big.subVV
   12220ms 16.23% 78.07%    47160ms 62.65%  math/big.nat.divLarge
    2470ms  3.28% 81.35%    17210ms 22.86%  math/big.basicSqr
    1150ms  1.53% 82.87%     1980ms  2.63%  runtime.scanobject
    1100ms  1.46% 84.34%     1100ms  1.46%  math/big.shlVU
     890ms  1.18% 85.52%      890ms  1.18%  runtime.memclrNoHeapPointers
     820ms  1.09% 86.61%      820ms  1.09%  math/big.greaterThan (inline)
     770ms  1.02% 87.63%      770ms  1.02%  math/big.shrVU

this happen the next day after I change root passwd through gateway:
echo "newpass"|etcdctl user passwd root --user="root:root" --interactive=false --endpoints="http://gateway:23790" (though this may not the direct cause)

@hexfusion
Copy link
Contributor

logs of the etcd instances would be helpful here.

@phosae
Copy link
Author

phosae commented Aug 10, 2019

logs of the etcd instances would be helpful here.

about 2019-08-09 10:50, the cluster switch to unstable mode, leader node changes frequently, as seeing log detail below.
we restart cluster (include the gateway) around 11:50, and later we cut off all the clients, the log goes normal, but cpu and virtual memory usage keep unbelievable high.

we enable debug log on etcd1 :
----------------------------------lots of these exception logs-------------------------------------
2019-08-09 10:57:50.500232 [pkg_logger.go:114] W | etcdserver: request "header:<ID:17006274213493527777 > authenticate:<name:root simple_token:>" with result "size:796" too
k too long (331.139139ms) to execute
2019-08-09 10:57:51.131115 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 71.742296ms)
2019-08-09 10:57:51.131153 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:57:51.131165 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 71.800355ms)
2019-08-09 10:57:51.131173 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:57:51.644504 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 123.516079ms)
2019-08-09 10:57:51.644542 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:57:51.644554 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 123.578241ms)
2019-08-09 10:57:51.644562 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:57:51.758637 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:19754
2019-08-09 10:57:52.253125 [pkg_logger.go:147] D | auth: jwt token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpNOLIaJmrc4l3Y2__
uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6
PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbf
Fkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcY
G2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI
2019-08-09 10:57:52.253169 [pkg_logger.go:147] D | auth: authorized root, token is eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpN
OLIaJmrc4l3Y2__uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbfFkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcYG2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI
2019-08-09 10:57:52.253239 [pkg_logger.go:114] W | etcdserver: request "header:<ID:17006274213493527779 > authenticate:<name:root simple_token:>" with result "size:796" took too long (369.255968ms) to execute
...
2019-08-09 10:59:04.482563 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 45.105602ms)
2019-08-09 10:59:04.482592 [pkg_logger.go:114] W | etcdserver: server is likely overloaded
2019-08-09 10:59:04.482602 [pkg_logger.go:114] W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 45.154347ms)
...
2019-08-09 11:47:33.391078 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full
2019-08-09 11:47:33.391130 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]
2019-08-09 11:47:33.511899 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full
2019-08-09 11:47:33.511955 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]
2019-08-09 11:47:33.592380 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full
2019-08-09 11:47:33.592432 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]
2019-08-09 11:47:33.713162 [pkg_logger.go:147] D | rafthttp: dropped MsgHeartbeat to e5a61e8de658ec02 since pipeline's sending buffer is full
2019-08-09 11:47:33.713206 [pkg_logger.go:147] D | raft: d9bfe1930e475fc failed to send message to e5a61e8de658ec02 because it is unreachable [next = 7000, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]
----------restart etcd cluster around 11:50, log looks quite normal, but cpu and virtual mem usage still high-------------
2019-08-09 11:50:50.185269 [pkg_logger.go:147] D | rafthttp: failed to dial e5a61e8de658ec02 on stream MsgApp v2 (dial tcp etcd2:2380: connect: connection refused)
2019-08-09 11:50:50.185407 [pkg_logger.go:147] D | rafthttp: failed to dial e5a61e8de658ec02 on stream Message (dial tcp etcd3:2380: connect: connection refused)
2019-08-09 11:50:50.268598 [pkg_logger.go:134] I | raft: d9bfe1930e475fc is starting a new election at term 4366
2019-08-09 11:50:50.268620 [pkg_logger.go:134] I | raft: d9bfe1930e475fc became candidate at term 4367

2019-08-09 11:54:06.674422 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:54:21.674427 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:54:36.674458 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:54:51.674488 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:55:06.674434 [pkg_logger.go:147] D | etcdserver/api/v2http: [GET] /metrics remote:prometheus:40558
2019-08-09 11:55:21.637016 [pkg_logger.go:147] D | auth: jwt token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJyZXZpc2lvbiI6MjQsInVzZXJuYW1lIjoicm9vdCJ9.qpsWIpNOLIaJmrc4l3Y2__uI9S9hAwGwpKo7sEj2iRcx2l6rMZKlP6LeKmlGKx91t-R5rCE56EqhbHD0W9h2KYwxexnLAE3BIDYsOVcisT7TkkGfmC1H-47PY7wq7a3Xq3Aitp4pTubEFLrL1Vph2ABwJQrn3No7pg9-Lid5_WezXeXpR4noFyX_fmcm7ZrhK6PoO_M2yMSWr4LjhgDs8tM7odDskR5R-EJI-P7QbtI0Hp8a549069Xzv70x0njuF_YoB-bgY9dwXvMUX609VSGHX-oyMzRfb6c559LM_iheaR2ShLqA8t6D4KpW8_OWgxBs8JTyKdOveu2NAhuqOQC8-uibGIE3VHWMagyEwXMbbfFkf24V9ksFkLus8MIWSFyu3xvq1mFcgwqxOZCAl1_uUGK8IZFVm6sY9k1Knvzd2xotIRsA4F1brA7tQjtU9v4C6CbVHbRXWXJGtOQaOIC5n7eH9Sobe9R54isClTlLVuZgLnxWV0GPHEc_cEmP5nElC_7GpKRBhWFO9Jp2W7bzcYG2atQfALenxriwXcy2EzLh5NEgPC5WWs9e_ImWvyfh-ehFMj-QZAtngBPbMg9OKNbcr0R46ReR6oIL0n-1-Afh_7AqqGpgeslGjPbl56RRJM4fA-sCEK6aw29pYGCMgZWdmWA4O7wv-R25ppI

--------error log on etcd 2, lots of go routine error---------
goroutine 50862 [select, 1 minutes]:
runtime.gopark(0x5633e10783f0, 0x0, 0x5633e09fda66, 0x6, 0x18, 0x1)
/opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/proc.go:291 +0x120 fp=0xc46b51f678 sp=0xc46b51f658 pc=0x5633e00cefb0
runtime.selectgo(0xc46b51f8e0, 0xc46b0e5080)
/opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/select.go:392 +0xe56 fp=0xc46b51f850 sp=0xc46b51f678 pc=0x5633e00ded06
github.com/coreos/etcd/etcdserver/api/v3rpc.(*watchServer).Watch(0xc4205f3d00, 0x5633e108f560, 0xc46aa79070, 0x5633e107caa0, 0xc4205f3d00)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/api/v3rpc/watch.go:151 +0x39b fp=0xc46b51fa08 sp=0xc46b51f850 pc=0x5633e08f543b
github.com/coreos/etcd/etcdserver/etcdserverpb._Watch_Watch_Handler(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e180, 0xc46b388540, 0x5633e00b2c4a, 0x20)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3481 +0xb4 fp=0xc46b51fa50 sp=0xc46b51fa08 pc=0x5633e04e0054
github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e180, 0xc46b388540, 0x19, 0xc46b40afa0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:71 +0xf7 fp=0xc46b51faa8 sp=0xc46b51fa50 pc=0x5633e08c8c47
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).StreamServerInterceptor.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0xc46b40af50, 0x5633e15dd638, 0xc46aeb42c0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:125 +0x12c fp=0xc46b51fb40 sp=0xc46b51faa8 pc=0x5633e08cc8ac
github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0x5633e15de820, 0xc46aeb6300)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:74 +0x9e fp=0xc46b51fb98 sp=0xc46b51fb40 pc=0x5633e08c8bee
github.com/coreos/etcd/etcdserver/api/v3rpc.newStreamInterceptor.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0xc46b40af50, 0x0, 0x0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/etcdserver/api/v3rpc/interceptor.go:212 +0xf4 fp=0xc46b51fc28 sp=0xc46b51fb98 pc=0x5633e08f7c04
github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1(0x5633e0ed9460, 0xc4205f3d00, 0x5633e108e360, 0xc46b2e6800, 0xc46b388520, 0x5633e1075158, 0x5633e04e0164, 0xc460c10bc0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/src/github.com/coreos/etcd/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:79 +0x168 fp=0xc46b51fc98 sp=0xc46b51fc28 pc=0x5633e08c8de8
google.golang.org/grpc.(*Server).processStreamingRPC(0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00, 0xc4205474a0, 0x5633e15681c0, 0x0, 0x0, 0x0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:957 +0x2cf fp=0xc46b51fdc8 sp=0xc46b51fc98 pc=0x5633e04b315f
google.golang.org/grpc.(*Server).handleStream(0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00, 0x0)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:1044 +0x12b3 fp=0xc46b51ff80 sp=0xc46b51fdc8 pc=0x5633e04b4cb3
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc421783690, 0xc4200bb040, 0x5633e108f140, 0xc42140f4a0, 0xc46ae95d00)
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:589 +0xa1 fp=0xc46b51ffb8 sp=0xc46b51ff80 pc=0x5633e04bad71
runtime.goexit()
/opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc46b51ffc0 sp=0xc46b51ffb8 pc=0x5633e00fcad1
created by google.golang.org/grpc.(*Server).serveStreams.func1
/builddir/build/BUILD/etcd-2cf9e51d2a78003b164c2998886158e60ded1cbb/Godeps/_workspace/src/google.golang.org/grpc/server.go:587 +
--------------------etcd3 just have warning log-------------------------
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:24:19 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
WARNING: 2019/08/09 11:50:34 grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport

@FengXingYuXin
Copy link

log maybe the result, not the reason。you can perf top -p pid to deal with cpu hot trouble。

@phosae
Copy link
Author

phosae commented Aug 20, 2019

Can some body help this issue?
perf top output are here:
$ perf top --sort comm,dso
99.34% etcd etcd
0.24% perf perf
0.10% perf [kernel]
$ perf top
9.33% etcd [.] 0x00000000006315cd
8.77% etcd [.] _start
2.54% etcd [.] 0x0000000000640a81
2.49% etcd [.] 0x000000000064081f
2.40% etcd [.] 0x0000000000640b48
2.19% etcd [.] 0x0000000000640ad1
2.19% etcd [.] 0x0000000000640aba
2.16% etcd [.] 0x0000000000640a97
2.11% etcd [.] 0x0000000000640824
2.09% etcd [.] 0x0000000000640b5e
2.03% etcd [.] 0x0000000000640b7e
1.94% etcd [.] 0x0000000000640aa3
1.94% etcd [.] 0x0000000000640ac5
1.89% etcd [.] 0x0000000000640811
1.88% etcd [.] 0x0000000000640807
1.83% etcd [.] 0x0000000000640a8c
1.80% etcd [.] 0x0000000000640aae
1.45% etcd [.] 0x0000000000640b6b
1.44% etcd [.] 0x0000000000640815
1.41% etcd [.] 0x000000000064081a
1.33% etcd [.] 0x0000000000640a7a
1.24% etcd [.] 0x0000000000640b53
1.15% etcd [.] 0x0000000000640b50
1.02% etcd [.] 0x0000000000640a84
1.01% etcd [.] 0x00000000006407fe
1.01% etcd [.] 0x0000000000640b4c
1.00% etcd [.] 0x00000000006407eb
0.91% etcd [.] 0x0000000000640b45
0.88% etcd [.] 0x0000000000640a9e
0.88% etcd [.] 0x000000000064080c
0.84% etcd [.] 0x0000000000640b57
0.81% [kernel] [k] __softirqentry_text_start
0.74% etcd [.] 0x0000000000640b76
0.72% etcd [.] 0x00000000006409c0
0.70% etcd [.] 0x0000000000640b6f
0.65% etcd [.] 0x00000000006409c4
0.64% etcd [.] 0x00000000006407f9
0.63% etcd [.] 0x0000000000640802
0.60% etcd [.] 0x0000000000640b79
0.60% etcd [.] 0x0000000000640a88
0.59% etcd [.] 0x0000000000640a9a
0.58% [kernel] [k] run_timer_softirq
0.56% etcd [.] 0x0000000000640b66
0.54% etcd [.] 0x0000000000640ab5
0.54% etcd [.] 0x0000000000640acc
0.53% etcd [.] 0x0000000000640b72
0.49% etcd [.] 0x000000000063147c
0.41% etcd [.] 0x0000000000640a7e
0.41% etcd [.] 0x00000000006407e8
0.38% etcd [.] 0x0000000000640ab1
0.35% etcd [.] 0x0000000000640a94
0.34% etcd [.] 0x0000000000640a22
0.32% etcd [.] 0x000000000062fdfd
0.31% etcd [.] 0x00000000006315eb
0.30% etcd [.] 0x0000000000640ac8
0.30% etcd [.] 0x00000000006313cf
0.30% etcd [.] 0x0000000000640b63
0.29% etcd [.] 0x0000000000640a26
0.28% [kernel] [k] __lock_text_start
0.27% etcd [.] 0x00000000006313d4
0.27% etcd [.] 0x0000000000631528
0.27% etcd [.] 0x000000000062fee6
0.26% etcd [.] 0x000000000057b1c7
0.25% etcd [.] 0x0000000000631457
0.24% etcd [.] 0x00000000006407ef
0.19% etcd [.] 0x000000000063140a
0.18% etcd [.] 0x000000000063155b
0.17% etcd [.] 0x0000000000640b41

$ perf top -g
Children Self Shared Object Symbol

  • 34.94% 0.00% etcd [.] 0xffffaa3397c55ad1
  • 34.89% 0.00% etcd [.] 0xffffaa3398013d71
  • 34.89% 0.00% etcd [.] 0xffffaa339800dd1a
  • 34.89% 0.00% etcd [.] 0xffffaa339800ab4a
  • 34.89% 0.00% etcd [.] 0xffffaa3398038029
  • 34.89% 0.00% etcd [.] 0xffffaa33984219ee
  • 34.89% 0.00% etcd [.] 0xffffaa339845092d
  • 34.89% 0.00% etcd [.] 0xffffaa33984217b1
  • 34.89% 0.00% etcd [.] 0xffffaa339845073c
  • 34.89% 0.00% etcd [.] 0xffffaa339842562d
  • 34.89% 0.00% etcd [.] 0xffffaa339842182a
  • 34.89% 0.00% etcd [.] 0xffffaa3398084ba8
  • 34.89% 0.00% etcd [.] 0xffffaa339845312c
  • 34.89% 0.00% etcd [.] 0xffffaa33984483ec
  • 34.89% 0.00% etcd [.] 0xffffaa33983a697c
  • 34.82% 0.00% etcd [.] 0xffffaa33983aa9f5
  • 34.82% 0.00% etcd [.] 0xffffaa33983abff4
  • 34.80% 0.00% etcd [.] 0xffffaa3398130647
  • 34.80% 0.00% etcd [.] 0xffffaa339812da16
  • 34.76% 0.00% etcd [.] 0xffffaa33981260cb
  • 34.76% 0.00% etcd [.] 0xffffaa33981216a5
  • 34.74% 0.00% etcd [.] 0xffffaa339811eb89
  • 32.92% 0.00% etcd [.] 0xffffaa339811f499
  • 32.52% 0.00% etcd [.] 0xffffaa339811ff72
  • 32.06% 0.00% etcd [.] 0xffffaa3397d64538
  • 32.03% 0.00% etcd [.] 0xffffaa3397d673c5
  • 32.02% 0.00% etcd [.] 0xffffaa3397d027d8
  • 22.61% 0.00% etcd [.] 0xffffaa3397d0e457
  • 22.60% 0.04% perf [.] hist_entry_iter__add
  • 22.59% 0.00% etcd [.] 0xffffaa3397d0be6b
  • 12.14% 0.00% libc-2.26.so [.] __libc_start_main
  • 12.14% 0.00% [unknown] [.] 0xd856258d4c544155
  • 12.14% 0.00% perf [.] main
  • 12.14% 0.00% perf [.] 0xffffffffffc9aae1
  • 12.14% 0.00% perf [.] 0xffffffffffc9a7df
  • 12.14% 0.00% perf [.] cmd_top
  • 12.04% 0.00% perf [.] 0xffffffffffc4ce65
  • 7.91% 7.01% etcd [.] _start
  • 7.85% 0.00% etcd [.] 0xffffaa3397d0e34e
  • 7.69% 0.00% etcd [.] 0xffffaa3397d0b1fb
  • 7.47% 0.06% perf [.] callchain_append
  • 7.12% 7.12% etcd [.] 0x00000000006315cd
    $ perf top -e cycles:k
    20.00% [kernel] [k] finish_task_switch
    14.74% [kernel] [k] __lock_text_start
    5.85% [kernel] [k] __softirqentry_text_start
    5.82% [kernel] [k] tick_nohz_idle_enter
    4.61% [kernel] [k] do_syscall_64
    3.34% [kernel] [k] get_user_pages_fast
    3.22% [kernel] [k] __sched_text_start
    2.04% [kernel] [k] tick_nohz_idle_exit
    1.53% [kernel] [k] __audit_syscall_entry
    1.32% [kernel] [k] ena_io_poll
    1.01% [kernel] [k] syscall_trace_enter
    0.97% [kernel] [k] futex_wait_setup
    0.89% [kernel] [k] exit_to_usermode_loop
    0.88% [kernel] [k] __netif_receive_skb_core
    0.86% [kernel] [k] schedule
    0.73% [kernel] [k] mark_wake_futex
    0.69% [kernel] [k] __audit_syscall_exit
    0.62% [kernel] [k] futex_wait_queue_me
    0.61% [kernel] [k] do_idle
    0.56% [kernel] [k] __indirect_thunk_start
    0.54% [kernel] [k] memset_erms
    0.51% [kernel] [k] get_futex_key
    0.51% [kernel] [k] __do_page_fault
    0.49% [kernel] [k] validate_xmit_skb
    0.49% [kernel] [k] dm_rh_inc_pending
    0.48% [kernel] [k] mutex_lock
    0.47% [kernel] [k] sys_pselect6
    0.46% [kernel] [k] __vfs_read
    0.46% [kernel] [k] dev_gro_receive
    0.44% [kernel] [k] copy_user_generic_unrolled
    0.43% [kernel] [k] poll_schedule_timeout
    0.42% [kernel] [k] kmem_cache_free
    0.42% [kernel] [k] futex_wait
    0.40% [kernel] [k] quiet_vmstat
    0.39% [kernel] [k] find_first_bit

@jingyih
Copy link
Contributor

jingyih commented Aug 21, 2019

As a sanity check, make sure all 3 etcd servers are up and running, and they are connected to each other.

@phosae
Copy link
Author

phosae commented Aug 26, 2019

As a sanity check, make sure all 3 etcd servers are up and running, and they are connected to each other.

Every time we restart the cluster, each node are up and running and they are connected to each other. But several day later, CPU overload problem will make node failed to receive heart beat from each other.

The only thing we can do is restarting the cluster, but it can't help decreasing the high cpu and virtual memory usage.

@weopm
Copy link

weopm commented Jan 8, 2020

this bug fixed in [#10218]

@phosae
Copy link
Author

phosae commented Jan 8, 2020

this bug fixed in [#10218]

thx, this problem doesn't reappear since we update our etcd cluster.

@phosae phosae closed this as completed Jan 8, 2020
@liujjcdl
Copy link

@weopm May I know if etcd 3.3.12 still has the issue? We have similar high CPU usage issue on etcd 3.3.12. Thanks!

@weopm
Copy link

weopm commented Jan 10, 2020

this bug is still exit on etcd 3.3.18. Maybe you can do a patch with (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560) or update to etcd 4.4.3.

@liujjcdl
Copy link

@weopm thanks for the quick reply, to double confirm, should it be 3.4.3 has the fix?

@weopm
Copy link

weopm commented Jan 10, 2020

this bug is fixed on etcd 4.4.3, not 3.4.3. you can review and confirm the code by yourself with (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560).

@cjquinon
Copy link

@weopm - We're seeing this issue in production workloads on 3.3.12. I noticed that there was a cherry-pick commit for 3.4, which was merged, and a cherry-pick for 3.3, which was closed and not merged.

Is it possible to get this fix into 3.3 and create a new patch release?

I do see it's in the v3.4.3 tag as well: https://github.com/etcd-io/etcd/blob/v3.4.3/etcdserver/v3_server.go#L607

@cjquinon
Copy link

PR to cherry-pick into 3.3: #11378

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants