Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leadership thrashing and heartbeat near election timeout warnings #868

Closed
marc-barry opened this issue Jun 24, 2014 · 32 comments
Closed

Leadership thrashing and heartbeat near election timeout warnings #868

marc-barry opened this issue Jun 24, 2014 · 32 comments

Comments

@marc-barry
Copy link

We are running etcd in production on a cluster of 5 machines. These are fairly powerful machines which aren't heavily loaded. 4 of the machines are in the same rack. The other is in the same data center. Ping times between the 5 node cluster don't exceed 0.3 ms.

We are currently running version v0.4.3 on all nodes.

We have been seeing issues whereby leadership changes occur many times in an hour. Other times, leadership remains constant for a longer period of time. But generally, leadership changes more than once per day.

The following is a log from one of our nodes. A message like the following appears quite often and it is occurring on all nodes: warning: heartbeat near election timeout: 230.012467ms. Note that the log capture below even shows a leadership change event.

2014-06-24_09:30:52.28621 [etcd] Jun 24 09:30:52.286 INFO      | node02: warning: heartbeat near election timeout: 230.153559ms
2014-06-24_09:32:17.67848 [etcd] Jun 24 09:32:17.678 INFO      | node02: snapshot of 10005 events at index 8250681 completed
2014-06-24_10:05:04.32174 [etcd] Jun 24 10:05:04.321 INFO      | node02: warning: heartbeat near election timeout: 214.635672ms
2014-06-24_10:48:05.78297 [etcd] Jun 24 10:48:05.782 INFO      | node02: snapshot of 10005 events at index 8260686 completed
2014-06-24_11:00:37.21393 [etcd] Jun 24 11:00:37.213 INFO      | node02: warning: heartbeat near election timeout: 206.198492ms
2014-06-24_11:37:53.66284 [etcd] Jun 24 11:37:53.662 INFO      | node02: warning: heartbeat near election timeout: 204.851479ms
2014-06-24_11:51:51.72018 [etcd] Jun 24 11:51:51.720 INFO      | node02: warning: heartbeat near election timeout: 212.603747ms
2014-06-24_11:59:48.21260 [etcd] Jun 24 11:59:48.212 INFO      | node02: warning: heartbeat near election timeout: 204.806339ms
2014-06-24_12:03:51.19359 [etcd] Jun 24 12:03:51.193 INFO      | node02: snapshot of 10001 events at index 8270687 completed
2014-06-24_12:05:53.76829 [etcd] Jun 24 12:05:53.768 INFO      | node02: warning: heartbeat near election timeout: 210.730741ms
2014-06-24_12:20:43.93769 [etcd] Jun 24 12:20:43.937 INFO      | node02: warning: heartbeat near election timeout: 230.012467ms
2014-06-24_12:36:20.01890 [etcd] Jun 24 12:36:20.018 INFO      | node02: warning: heartbeat near election timeout: 211.284365ms
2014-06-24_12:41:03.40828 [etcd] Jun 24 12:41:03.408 INFO      | node02: warning: heartbeat near election timeout: 200.771973ms
2014-06-24_13:00:24.20963 [etcd] Jun 24 13:00:24.209 INFO      | node02: warning: heartbeat near election timeout: 202.204165ms
2014-06-24_13:02:02.98420 [etcd] Jun 24 13:02:02.984 INFO      | node02: warning: heartbeat near election timeout: 276.732103ms
2014-06-24_13:03:41.78254 [etcd] Jun 24 13:03:41.782 INFO      | node02: warning: heartbeat near election timeout: 225.107469ms
2014-06-24_13:08:40.76213 [etcd] Jun 24 13:08:40.762 INFO      | node02: warning: heartbeat near election timeout: 204.810718ms
2014-06-24_13:11:08.10957 [etcd] Jun 24 13:11:08.109 INFO      | node02: term #192064 started.
2014-06-24_13:11:08.10962 [etcd] Jun 24 13:11:08.109 INFO      | node02: leader changed from 'node03' to ''.
2014-06-24_13:11:08.41734 [etcd] Jun 24 13:11:08.417 INFO      | node02: term #192065 started.
2014-06-24_13:13:10.48635 [etcd] Jun 24 13:13:10.486 INFO      | node02: warning: heartbeat near election timeout: 217.458409ms
2014-06-24_13:15:52.77356 [etcd] Jun 24 13:15:52.773 INFO      | node02: warning: heartbeat near election timeout: 204.752586ms
2014-06-24_13:19:36.86114 [etcd] Jun 24 13:19:36.861 INFO      | node02: snapshot of 10004 events at index 8280691 completed
2014-06-24_13:39:38.08433 [etcd] Jun 24 13:39:38.084 INFO      | node02: warning: heartbeat near election timeout: 265.827895ms
2014-06-24_13:40:07.59983 [etcd] Jun 24 13:40:07.599 INFO      | node02: warning: heartbeat near election timeout: 230.380641ms
2014-06-24_13:43:16.72872 [etcd] Jun 24 13:43:16.728 INFO      | node02: warning: heartbeat near election timeout: 210.388067ms
2014-06-24_14:10:59.82962 [etcd] Jun 24 14:10:59.829 INFO      | node02: warning: heartbeat near election timeout: 211.872611ms
2014-06-24_14:14:22.13860 [etcd] Jun 24 14:14:22.138 INFO      | node02: warning: heartbeat near election timeout: 220.687067ms
2014-06-24_14:17:45.78991 [etcd] Jun 24 14:17:45.789 INFO      | node02: warning: heartbeat near election timeout: 222.335913ms
2014-06-24_14:29:49.48276 [etcd] Jun 24 14:29:49.482 INFO      | node02: warning: heartbeat near election timeout: 215.462707ms
2014-06-24_14:35:22.70772 [etcd] Jun 24 14:35:22.707 INFO      | node02: snapshot of 10002 events at index 8290693 completed
2014-06-24_15:29:21.44882 [etcd] Jun 24 15:29:21.448 INFO      | node02: warning: heartbeat near election timeout: 232.209112ms
2014-06-24_15:51:08.76241 [etcd] Jun 24 15:51:08.762 INFO      | node02: snapshot of 10002 events at index 8300695 completed
2014-06-24_17:06:54.97829 [etcd] Jun 24 17:06:54.978 INFO      | node02: snapshot of 10002 events at index 8310697 completed

In terms of the interval and timeouts these are the settings we are using (on all nodes):

  • -peer-heartbeat-interval=50
  • -peer-election-timeout=250

We have performed packet capture between the nodes and they are (at the networking layer) sending and receiving the heartbeats at a 50 ms interval.

As of right now etcd has not been stable for us. This has caused issues for out etcd clients. I am trying to figure out why this is happening and what next steps I can take to troubleshoot the issue and stabilize our etcd production cluster.

@yichengq
Copy link
Contributor

@marc-barry
Do you have the CPU and disk IO usage stats of the machines at this period?
Could you also try to curl http://127.0.0.1:4001/v2/stats/self on each machine? They could tell us the view of these etcd processes.
Thanks for the report!

@marc-barry
Copy link
Author

@unihorn

Do you have the CPU and disk IO usage stats of the machines at this period?

While typing this up the leadership changed many times to different machines. Thus, keep this in mind when reading the results from the self stats.

Each machine has 32 cores with 96 GB of RAM. Here is the CPU utilization as reported by top on the leader (currently node05) and for a follower (node04). The following warning is being reported on node04 at the moment.

2014-06-25_12:03:29.42332 [etcd] Jun 25 12:03:29.423 INFO      | node04: warning: heartbeat near election timeout: 249.81457ms

top CPU utilization for node05 (current leader):

Cpu(s):  0.2%us,  0.1%sy,  0.0%ni, 99.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st

top etcd utilization for node05 (current leader):

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
32623 etcd      20   0  731m 200m 4760 S    3  0.2  48:50.38 etcd

top CPU utilization for node04 (current follower --> reporting warning about heartbeat):

Cpu(s):  1.9%us,  0.6%sy,  0.0%ni, 97.4%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st

top etcd utilization for node04 (current follower):

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  615 etcd      20   0  728m 175m 4164 S    4  0.2  50:00.08 etcd

Here is the output from the iostat -d -x command from node02 as while typing this out it was elected as the new leader:

Linux 3.8.0-31-generic (node02)     06/25/2014  _x86_64_    (32 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.01    45.27    0.43  136.43    31.17  1007.15    15.17     0.14    1.02    7.81    0.99   0.22   3.03
sdb               0.00     5.37    0.10    3.72     3.10    91.90    49.77     0.01    2.34    3.85    2.30   0.55   0.21
sdc               0.00     5.02    0.61    4.30    34.63   314.61   142.32     0.06   11.23    1.74   12.57   1.42   0.70
sdd               0.00     5.42    0.37    4.61     8.97   323.87   133.74     0.06   12.85    3.55   13.59   1.77   0.88
sde               0.00     6.09    0.65    5.23    33.98   284.73   108.44     0.06   10.24    2.42   11.20   1.44   0.84
sdf               0.00     5.63    0.28    4.27     6.32   265.31   119.38     0.04    9.47    2.79    9.91   1.29   0.59
dm-3              0.00     0.00    0.40  181.67    31.00  1007.05    11.40     0.16    0.86    8.19    0.84   0.17   3.01
dm-4              0.00     0.00    0.02    0.02     0.07     0.10     8.00     0.00   94.36   12.10  150.36   1.36   0.01

Could you also try to curl http://127.0.0.1:4001/v2/stats/self on each machine? They could tell us the view of these etcd processes.

Here are the results of collecting these stats from each node. Note that there were leadership changes while I was collecting the following results.

node01

{
   "name":"node01",
   "state":"follower",
   "startTime":"2014-06-23T17:08:44.530204794Z",
   "leaderInfo":{
      "leader":"node03",
      "uptime":"1m29.278701218s",
      "startTime":"2014-06-25T12:22:41.775389245Z"
   },
   "recvAppendRequestCnt":3009522,
   "recvPkgRate":20.100565290971556,
   "recvBandwidthRate":777.8918767605992,
   "sendAppendRequestCnt":406670
}

node02

{
   "name":"node02",
   "state":"follower",
   "startTime":"2014-06-23T17:07:12.11602086Z",
   "leaderInfo":{
      "leader":"node01",
      "uptime":"46.824809849s",
      "startTime":"2014-06-25T12:26:41.045445911Z"
   },
   "recvAppendRequestCnt":2875177,
   "recvPkgRate":20.100681162164545,
   "recvBandwidthRate":777.896360975768,
   "sendAppendRequestCnt":963892
}

node03

{
   "name":"node03",
   "state":"follower",
   "startTime":"2014-06-23T17:05:52.575371888Z",
   "leaderInfo":{
      "leader":"node01",
      "uptime":"407.642669ms",
      "startTime":"2014-06-25T12:30:12.124416457Z"
   },
   "recvAppendRequestCnt":2693330,
   "recvPkgRate":19.677688628106743,
   "recvBandwidthRate":763.4943187705417,
   "sendAppendRequestCnt":1714171
}

node04

{
   "name":"node04",
   "state":"follower",
   "startTime":"2014-06-23T17:02:29.035340443Z",
   "leaderInfo":{
      "leader":"node05",
      "uptime":"32.372670147s",
      "startTime":"2014-06-25T12:30:42.875519676Z"
   },
   "recvAppendRequestCnt":2037376,
   "recvPkgRate":20.100540445009845,
   "recvBandwidthRate":777.9914179241059,
   "sendAppendRequestCnt":4358807
}

node05

{
   "name":"node05",
   "state":"leader",
   "startTime":"2014-06-23T17:11:27.907702227Z",
   "leaderInfo":{
      "leader":"node05",
      "uptime":"1m25.597956104s",
      "startTime":"2014-06-25T12:30:42.875152498Z"
   },
   "recvAppendRequestCnt":1863075,
   "sendAppendRequestCnt":5016584,
   "sendPkgRate":81.62929076739682,
   "sendBandwidthRate":3299.4559328181795
}

Thanks for the report!

Your welcome. Thanks for the help.

@yichengq
Copy link
Contributor

The network and disk IO could hardly be the bottleneck of this problem. The utilization of both is low enough.
The machine stats seem to be ok also.

Could you also try to do

curl -L http://127.0.0.1:4001/v2/stats/leader
curl http://127.0.0.1:4001/v2/stats/store

?

@marc-barry
Copy link
Author

Could you also try to do

curl -L http://127.0.0.1:4001/v2/stats/leader
curl http://127.0.0.1:4001/v2/stats/store
?

@unihorn

Sure. Here are the results:

curl -L http://127.0.0.1:4001/v2/stats/leader on node01. Note that a leadership change happened just before I did this.*

{
   "leader":"node03",
   "followers":{
      "node01":{
         "latency":{
            "current":0.650869,
            "average":0.9658280053054744,
            "standardDeviation":3.2272089958544545,
            "minimum":0.291784,
            "maximum":398.90756
         },
         "counts":{
            "fail":2,
            "success":502876
         }
      },
      "node02":{
         "latency":{
            "current":0.891281,
            "average":1.059197478397645,
            "standardDeviation":3.599525384820501,
            "minimum":0.294793,
            "maximum":337.68265
         },
         "counts":{
            "fail":6,
            "success":502816
         }
      },
      "node04":{
         "latency":{
            "current":1.084962,
            "average":1.01607957382241,
            "standardDeviation":3.4257407723845104,
            "minimum":0.272535,
            "maximum":337.845014
         },
         "counts":{
            "fail":0,
            "success":502842
         }
      },
      "node05":{
         "latency":{
            "current":0.963331,
            "average":3.223443277597172,
            "standardDeviation":7.651197977242439,
            "minimum":0.319269,
            "maximum":337.696842
         },
         "counts":{
            "fail":0,
            "success":502804
         }
      }
   }
}

curl http://127.0.0.1:4001/v2/stats/store on all nodes:
node01

{
   "getsSuccess":1917817,
   "getsFail":3581221,
   "setsSuccess":905023,
   "setsFail":730,
   "deleteSuccess":4737,
   "deleteFail":4882,
   "updateSuccess":597859,
   "updateFail":339,
   "createSuccess":3450,
   "createFail":43551,
   "compareAndSwapSuccess":0,
   "compareAndSwapFail":0,
   "compareAndDeleteSuccess":1945,
   "compareAndDeleteFail":26,
   "expireCount":15,
   "watchers":0
}

node02

{
   "getsSuccess":1599914,
   "getsFail":2475617,
   "setsSuccess":905023,
   "setsFail":730,
   "deleteSuccess":4737,
   "deleteFail":4882,
   "updateSuccess":597871,
   "updateFail":339,
   "createSuccess":3450,
   "createFail":43551,
   "compareAndSwapSuccess":0,
   "compareAndSwapFail":0,
   "compareAndDeleteSuccess":1945,
   "compareAndDeleteFail":26,
   "expireCount":15,
   "watchers":1
}

node03

{
   "getsSuccess":1907889,
   "getsFail":2981538,
   "setsSuccess":905023,
   "setsFail":730,
   "deleteSuccess":4737,
   "deleteFail":4882,
   "updateSuccess":597898,
   "updateFail":339,
   "createSuccess":3450,
   "createFail":43551,
   "compareAndSwapSuccess":0,
   "compareAndSwapFail":0,
   "compareAndDeleteSuccess":1945,
   "compareAndDeleteFail":26,
   "expireCount":487,
   "watchers":0
}

node04

{
   "getsSuccess":1914088,
   "getsFail":2872526,
   "setsSuccess":905023,
   "setsFail":730,
   "deleteSuccess":4737,
   "deleteFail":4882,
   "updateSuccess":597913,
   "updateFail":339,
   "createSuccess":3450,
   "createFail":43551,
   "compareAndSwapSuccess":0,
   "compareAndSwapFail":0,
   "compareAndDeleteSuccess":1945,
   "compareAndDeleteFail":26,
   "expireCount":487,
   "watchers":2
}

node05

{
   "getsSuccess":1697718,
   "getsFail":2567181,
   "setsSuccess":905023,
   "setsFail":730,
   "deleteSuccess":4737,
   "deleteFail":4882,
   "updateSuccess":597925,
   "updateFail":339,
   "createSuccess":3450,
   "createFail":43551,
   "compareAndSwapSuccess":0,
   "compareAndSwapFail":0,
   "compareAndDeleteSuccess":1945,
   "compareAndDeleteFail":26,
   "expireCount":472,
   "watchers":3
}

I am also pretty sure that out watches have stopped firing. We use go-etcd on some apps and they perform watches on certain keys. The keys are changing and the watches have stopped working. I am hoping this is just due to the issues we are seeing with etcd.

@burke
Copy link

burke commented Jun 26, 2014

I don't think marc has mentioned that etcd CPU usage appears to grow steadily. Since he posted this yesterday, our etcd processes on the 5 nodes have grown to 8.0%, 8.4%, 7.7%, 7.4%, and 7.8% CPU usage. Each number is quite stable and doesn't fluctuate much on a second-to-second basis. When we upgraded from 0.4.1 to 0.4.3 I noticed at least one etcd process with over 21% CPU usage.

@marc-barry
Copy link
Author

@burke @unihorn

Burke is correct. Checkout this CPU usage. All nodes are over 30% for CPU usage.

node01:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM
20119 etcd      20   0  825m 274m 4108 S   32  0.3

node02:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM
31673 etcd      20   0  909m 297m 4144 S   35  0.3

node03:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM
 8284 etcd      20   0  758m 286m 4088 S   33  0.3

node04:

PID USER      PR  NI  VIRT  RES  SHR S %CPU
  615 etcd      20   0  820m 278m 4112 S   33

node05:

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM
32623 etcd      20   0  816m 275m 4796 S   32  0.3

htop on node05 shows one etcd parent process with 7 etcd children. Three of the seven are using 0% CPU. Four of the seven are using 8% each. htop on node02 shows one etcd parent process with 9 etcd children.

Here is a curl -L http://127.0.0.1:4001/v2/stats/leader again on node04.

{
   "leader":"node04",
   "followers":{
      "node01":{
         "latency":{
            "current":1.054622,
            "average":1.0004305487068865,
            "standardDeviation":4.018218436601287,
            "minimum":0.264388,
            "maximum":472.419893
         },
         "counts":{
            "fail":67,
            "success":1794038
         }
      },
      "node02":{
         "latency":{
            "current":0.638996,
            "average":1.0337069312046558,
            "standardDeviation":5.082345637648887,
            "minimum":0.240994,
            "maximum":472.119836
         },
         "counts":{
            "fail":61,
            "success":1793304
         }
      },
      "node03":{
         "latency":{
            "current":0.702603,
            "average":0.9834211504224556,
            "standardDeviation":3.366540155673035,
            "minimum":0.273594,
            "maximum":452.885206
         },
         "counts":{
            "fail":17,
            "success":1794799
         }
      },
      "node05":{
         "latency":{
            "current":0.863074,
            "average":3.3656152790531073,
            "standardDeviation":8.571406992641448,
            "minimum":0.28737,
            "maximum":476.095643
         },
         "counts":{
            "fail":69,
            "success":1793210
         }
      }
   }
}

@marc-barry
Copy link
Author

@unihorn

We are now seeing a new leader elected at a rate of more than 1 per minute. Actually, I would say that it is happening at a rate of 2-3 times per minute.

@yichengq
Copy link
Contributor

@marc-barry @burke
Do you know why its CPU usage grows from 3% to 8% to 32%? Does it grow stably always? Does it happen only when using 0.4.3?

The maximum latency is surprisingly high, and this is the root cause of frequent leader change. The latency is calculated in this way: make time points before and after "net/http".Client.Post, and treat time diff as duration.

@xiangli-cmu @philips Ideas?

@marc-barry
Copy link
Author

Do you know why its CPU usage grows from 3% to 8% to 32%? Does it grow stably always? Does it happen only when using 0.4.3?

@unihorn

We really don't know why this is the case. The etcd clients are remaining constant. There are a few watches with go-etcd and one client polls for changes at about 10 times/second. I cannot say for certain if the CPU utilization grows stably but it appears that is the case. We are working on collecting more empirical data about CPU utilization.

The performance issues (including CPU utilization) also happened with 0.4.1. I was hoping an upgrade from 0.4.1 to 0.4.3 would make the problem go away but it didn't.

@burke
Copy link

burke commented Jun 27, 2014

$ sudo strace -c -f -p <pid>
...<snip>...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.15    5.538681          28    195767     52049 futex
 47.91    5.187963         119     43473           epoll_wait
  0.92    0.099945           2     48010           select
  0.02    0.002020           9       232           fsync
  0.00    0.000138           0       629         6 write
  0.00    0.000080           0       608       301 read
  0.00    0.000038           0       232           lseek
  0.00    0.000000           0        14           close
  0.00    0.000000           0         6         6 rt_sigreturn
  0.00    0.000000           0        89           sched_yield
  0.00    0.000000           0        14           getsockname
  0.00    0.000000           0        14           setsockopt
  0.00    0.000000           0        28           epoll_ctl
  0.00    0.000000           0        28        14 accept4
------ ----------- ----------- --------- --------- ----------------
100.00   10.828865                289144     52376 total

On one hand, with time split between futex and epoll_wait, you'd think it wouldn't be using too much CPU, but then you notice that it's calling 18,000 times per second. This indicates to me we may be in goroutine scheduling hell.

@burke
Copy link

burke commented Jun 27, 2014

I grabbed a list of running goroutines about a minute apart:

(gdb) info goroutines
  1  waiting runtime.park
* 2  syscall runtime.notetsleepg
  5 runnable runtime.exitsyscall
* 4  syscall runtime.notetsleepg
  19  waiting runtime.park
  17  waiting runtime.park
  8  syscall runtime.goexit
  18  waiting runtime.park
  20  waiting runtime.park
  21  waiting runtime.park
  22  waiting runtime.park
  23  waiting runtime.park
  24  waiting runtime.park
  25  waiting runtime.park
  26  waiting runtime.park
  171221  waiting runtime.park
  349234  waiting runtime.park
  282404  waiting runtime.park
  244638  waiting runtime.park
  348202  waiting runtime.park
  348201  waiting runtime.park
  225383  waiting runtime.park
  253117  waiting runtime.park
  348206  waiting runtime.park
  130541  waiting runtime.park
  348205  waiting runtime.park
  225382  waiting runtime.park
  279428  waiting runtime.park
  239623  waiting runtime.park
  130542  waiting runtime.park
  348428  waiting runtime.park
  287810  waiting runtime.park
  323905  waiting runtime.park
  253131  waiting runtime.park
  323556  waiting runtime.park
  273845  waiting runtime.park
  253307  waiting runtime.park
  340095  waiting runtime.park
  146799  waiting runtime.park
  282403  waiting runtime.park
  146800  waiting runtime.park
  212194  waiting runtime.park
  302501  waiting runtime.park
  242746  waiting runtime.park
  279429  waiting runtime.park
  348530  waiting runtime.park
  212195  waiting runtime.park
  305051  waiting runtime.park
  239624  waiting runtime.park
  242745  waiting runtime.park
  171222  waiting runtime.park
  304136  waiting runtime.park
  244639  waiting runtime.park
  273844  waiting runtime.park
  300757  waiting runtime.park
  346271  waiting runtime.park
  304137  waiting runtime.park
  277345  waiting runtime.park
  277344  waiting runtime.park
  347734  waiting runtime.park
  348931  waiting runtime.park
  348427  waiting runtime.park
  302500  waiting runtime.park
  300573  waiting runtime.park
  349378  waiting runtime.park
  323555  waiting runtime.park
  255269  waiting runtime.park
  346727  waiting runtime.park
  348227  waiting runtime.park
  340093  waiting runtime.park
  348531  waiting runtime.park
  300758  waiting runtime.park
  287809  waiting runtime.park
  346728  waiting runtime.park
  348228  waiting runtime.park
  305052  waiting runtime.park
  343936  waiting runtime.park
  300572  waiting runtime.park
  323906  waiting runtime.park
  348930  waiting runtime.park
(gdb) info goroutines
  1  waiting runtime.park
* 2  syscall runtime.notetsleepg
  5 runnable runtime.exitsyscall
* 4  syscall runtime.notetsleepg
  19  waiting runtime.park
  17  waiting runtime.park
  8  syscall runtime.goexit
  18  waiting runtime.park
  20  waiting runtime.park
  21  waiting runtime.park
  22  waiting runtime.park
  23  waiting runtime.park
  24  waiting runtime.park
  25  waiting runtime.park
  26  waiting runtime.park
  171221  waiting runtime.park
  282404  waiting runtime.park
  244638  waiting runtime.park
  348202  waiting runtime.park
  348201  waiting runtime.park
  225383  waiting runtime.park
  253117  waiting runtime.park
  348206  waiting runtime.park
  130541  waiting runtime.park
  348205  waiting runtime.park
  225382  waiting runtime.park
  279428  waiting runtime.park
  239623  waiting runtime.park
  130542  waiting runtime.park
  348428  waiting runtime.park
  287810  waiting runtime.park
  323905  waiting runtime.park
  253131  waiting runtime.park
  323556  waiting runtime.park
  273845  waiting runtime.park
  253307  waiting runtime.park
  340095  waiting runtime.park
  146799  waiting runtime.park
  282403  waiting runtime.park
  146800  waiting runtime.park
  212194  waiting runtime.park
  302501  waiting runtime.park
  242746  waiting runtime.park
  279429  waiting runtime.park
  348428  waiting runtime.park
  287810  waiting runtime.park
  323905  waiting runtime.park
  253131  waiting runtime.park
  323556  waiting runtime.park
  273845  waiting runtime.park
  253307  waiting runtime.park
  340095  waiting runtime.park
  146799  waiting runtime.park
  282403  waiting runtime.park
  146800  waiting runtime.park
  212194  waiting runtime.park
  302501  waiting runtime.park
  242746  waiting runtime.park
  279429  waiting runtime.park
  348530  waiting runtime.park
  212195  waiting runtime.park
  305051  waiting runtime.park
  239624  waiting runtime.park
  242745  waiting runtime.park
  171222  waiting runtime.park
  304136  waiting runtime.park
  244639  waiting runtime.park
  273844  waiting runtime.park
  300757  waiting runtime.park
  346271  waiting runtime.park
  304137  waiting runtime.park
  277345  waiting runtime.park
  277344  waiting runtime.park
  348931  waiting runtime.park
  348427  waiting runtime.park
  302500  waiting runtime.park
  300573  waiting runtime.park
  323555  waiting runtime.park
  255269  waiting runtime.park
  346727  waiting runtime.park
  348227  waiting runtime.park
  340093  waiting runtime.park
  348531  waiting runtime.park
  300758  waiting runtime.park
  287809  waiting runtime.park
  346728  waiting runtime.park
  348228  waiting runtime.park
  305052  waiting runtime.park
  349566  waiting runtime.park
  300572  waiting runtime.park
  323906  waiting runtime.park
  348930  waiting runtime.park
  349547  waiting runtime.park

...and a diff:

17d16
<   349234  waiting runtime.park
45a45,59
>   348428  waiting runtime.park
>   287810  waiting runtime.park
>   323905  waiting runtime.park
>   253131  waiting runtime.park
>   323556  waiting runtime.park
>   273845  waiting runtime.park
>   253307  waiting runtime.park
>   340095  waiting runtime.park
>   146799  waiting runtime.park
>   282403  waiting runtime.park
>   146800  waiting runtime.park
>   212194  waiting runtime.park
>   302501  waiting runtime.park
>   242746  waiting runtime.park
>   279429  waiting runtime.park
60d73
<   347734  waiting runtime.park
65d77
<   349378  waiting runtime.park
77c89
<   343936  waiting runtime.park
---
>   349566  waiting runtime.park
80a93
>   349547  waiting runtime.park

...so there doesn't appear to be an unreasonable number of goroutines running concurrently, and the IDs aren't high enough to indicate a problem with really fast goroutine creation/termination.

@burke
Copy link

burke commented Jun 27, 2014

I take it back; it doesn't always grows stably: Last restart was June 23, and CPU usage currently ranges from 12.6-14.0% across nodes.

@marc-barry
Copy link
Author

@unihorn

Here is the CPU usage across all five nodes in the cluster since we started monitoring. It looks like it is increasing linear over time.

05bee8c299ced847812437f8ee5424d371989d05

@marc-barry
Copy link
Author

@unihorn

I restarted the etcd service on one of the machines last night and this is the CPU utilization profile (over about 12 hours). This doesn't seem normal.

f964d46896c8af0371f969634fe440e3bee72c4f

@marc-barry
Copy link
Author

Here is some profile data for one of the nodes. The profiling happened over a longer period (i.e. a few hours):

(pprof) top20
Total: 320199 samples
  151446  47.3%  47.3%   151446  47.3% runtime.futex
   30656   9.6%  56.9%    30656   9.6% runtime.usleep
   20943   6.5%  63.4%    20943   6.5% siftdown
   14655   4.6%  68.0%    14655   4.6% runtime.xchg
    6756   2.1%  70.1%     6756   2.1% etext
    5346   1.7%  71.8%     5346   1.7% runtime.epollwait
    5217   1.6%  73.4%     5217   1.6% runtime.futexsleep
    4720   1.5%  74.9%     5307   1.7% syscall.Syscall
    3979   1.2%  76.1%     3979   1.2% runtime.atomicload
    3672   1.1%  77.3%     3672   1.1% retake
    3084   1.0%  78.2%     3084   1.0% findrunnable
    2729   0.9%  79.1%    46713  14.6% timerproc
    2426   0.8%  79.8%     2426   0.8% acquirep
    2152   0.7%  80.5%     2152   0.7% runtime.notesleep
    2053   0.6%  81.1%     6318   2.0% runtime.unlock
    2042   0.6%  81.8%     4503   1.4% runtime.mallocgc
    1798   0.6%  82.3%     8243   2.6% runtime.lock
    1758   0.5%  82.9%     1758   0.5% runtime.newstack
    1721   0.5%  83.4%     1721   0.5% sysmon
    1661   0.5%  84.0%     1662   0.5% releasep

@marc-barry
Copy link
Author

We are going to try bumping GOMAXPROCS to a value > 1 and see what happens.

@xiang90
Copy link
Contributor

xiang90 commented Jul 2, 2014

@marc-barry We are working on this issue. Can you give me your email address? You can send me an email at xiang.li@coreos.com. Thanks!

@klizhentas
Copy link

@xiangli-cmu We are seeing the same issue with 0.4.2. CPU usage is 100% and the leader elections are occurring several times per minute. Let me know if you need any more data on this.

@arohter
Copy link

arohter commented Jul 17, 2014

Same problem here. Brand new 3-node v0.4.5 EC2 cluster; no actual clients / usage.

Seems related: #893

@arohter
Copy link

arohter commented Jul 22, 2014

etcd_cpu

CPU usage over 48h; c3.large cluster, sitting idle. m1.small growth is even more dramatic.

@xiang90
Copy link
Contributor

xiang90 commented Jul 22, 2014

@arohter might be related to #900. we just fixed it.

@arohter
Copy link

arohter commented Jul 22, 2014

Cool. We'll try compiling latest master tomorrow and give it a spin!

@marc-barry
Copy link
Author

@arohter might be related to #900. we just fixed it.

@xiangli-cmu @arohter @klizhentas

#900 seems to have solved this issue for us. CPU utilization no longer grows linearly with time. Leadership does change a few (i.e. about 4) times per day. We are currently running bdeb96b until this makes its way into a release.

@jeffbaier
Copy link

Any ideas on when there will be a new release with this fix? Our CoreOS clusters are badly being affected by it.

@philips
Copy link
Contributor

philips commented Jul 29, 2014

@jeffbaier It should show in alpha in a few days after I ship etcd 0.4.6 today.

@philips
Copy link
Contributor

philips commented Jul 29, 2014

I am going to close this out based on @marc-barry 's testing of #900. This fix will be in etcd 0.4.6.

@philips philips closed this as completed Jul 29, 2014
@arohter
Copy link

arohter commented Aug 1, 2014

While the fix in #900 definitely helped reduce the (re)election rate (and cpu time), we're still seeing quite a bit of leader fragility and flapping. #915 is very similar. We're seeing hundreds of election cycles when doing relatively low amounts of writes, or really anything that puts a bit of real load on the system. It kind of feels like the heartbeating is being blocked/stalled waiting for the etcd process to complete some work.

@marc-barry
Copy link
Author

I apologize for my lack of comments as I have been on vacation. As @arohter mentioned #900 has definitely helped. Our etcd cluster peers have been running the code from #900 since July 21 without restarting. We too are seeing fragility and flapping in leadership. As well, the CPU utilization has still been growing linearly over time. Here is a graph of the clusters CPU utilization from Jul 21 to August 13.

b50339b46134e63e12d8d0541c1bf4c419c67028

The trend is pretty evident. I am going to dump the CPU profile tomorrow and see what etcd has been doing. I think #915 is the the same as the issue we are seeing now. We have snapshot = true as well, but out cluster seems to have ran long enough that leadership elections are starting to occur more often. I'd hypothesize that @WillPlatnick in #915 will most likely see the same behaviour at some point in future. Especially since he said he is also seeing missed heartbeats. We are seeing lots of heartbeat near election timeout.

@mbentley
Copy link

@marc-barry We were seeing a similar situation where we would have steady CPU usage increase until the entire cluster was pegging the CPU on the boxes. What fixed it for us was increasing the peer-heartbeat-timeout (peer-heartbeat-timeout=2000) and peer-election-timeout (peer-election-timeout=3000) parameters.

@marc-barry
Copy link
Author

@marc-barry We were seeing a similar situation where we would have steady CPU usage increase until the entire cluster was pegging the CPU on the boxes. What fixed it for us was increasing the peer-heartbeat-timeout (peer-heartbeat-timeout=2000) and peer-election-timeout (peer-election-timeout=3000) parameters.

I have also increased the heartbeat timeout and the peer election timeout. But I didn't increase it as drastically (from the defaults) as you did. I imagine that doing what you did will in fact stop the leadership elections and may postpone the CPU increase. But there is still something not quite right. I am unsure of how 5 peers which are on very fast machines within the same rack can see such high delays. We have already verified with packet capture that the delays aren't network related. I need to better understand some comments in the tuning guide (https://github.com/coreos/etcd/blob/master/Documentation/tuning.md) such as

Adjusting these values is a trade off. Lowering the heartbeat interval will cause individual commands to be committed faster but it will lower the overall throughput of etcd. If your etcd instances have low utilization then lowering the heartbeat interval can improve your command response time.

Does this mean that increasing the heartbeat as you have will result in a cluster which has longer periods of time where a peer (follower) can be inconsistent with the leader?

@beldur
Copy link

beldur commented Sep 19, 2014

I have the same issue with 440 alpha in a cluster of 3 right now.
I will test the higher timeout values next.

@donatello
Copy link

I have the same issue on v0.4.6 with a 9 node peer cluster with lots of standbys (~35). We are running a cross data-centre cluster with heartbeat_timeout=400 and election_timeout=6000. etcd hits 100% CPU on the leader node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests