Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeing election rate ramp up with time #904

Closed
onsi opened this issue Jul 22, 2014 · 14 comments
Closed

Seeing election rate ramp up with time #904

onsi opened this issue Jul 22, 2014 · 14 comments

Comments

@onsi
Copy link

onsi commented Jul 22, 2014

This is etcd v0.4.3 with a leader election timeout of 1s.

We had a cluster of 3 etcd nodes under relatively light (but constant) load with an uptime of 14 days.

We started to see some etcd performance issues and investigated. Turned out that etcd was performing an election every ~minute and was on its ~19,000th term.

None of the stats from the etcd stats endpoints seemed concerning. Our latencies looked like:

{
"leader":"etcd_z2-0",
"followers":{
"etcd_z1-0":{"latency":{"current":3.932186,"average":4.135497288276446,"standardDeviation":19.117594667921274,"minimum":0.787246,"maximum":2931.903107},
"counts":{"fail":1265,"success":4697152}},
"etcd_z1-1":{"latency":{"current":7.809908,"average":5.0774800768625505,"standardDeviation":23.873377341480882,"minimum":0.804202,"maximum":2811.686235},
"counts":{"fail":1609,"success":4659563}}}
}

When we hit each node with a SIGQUIT we saw very many (~1000) goroutines that looked like:

goroutine 1612870 [chan receive]:
net/http.(*persistConn).readLoop(0xc21ac9a380)
  /usr/local/go/src/pkg/net/http/transport.go:778 +0x68f
created by net/http.(*Transport).dialConn
  /usr/local/go/src/pkg/net/http/transport.go:528 +0x607

The SIGQUIT subsequently killed each node (grrr...) when the node was restarted the system recovered and the elections were back under control.

We decided to dig deeper. We took our etcd logs and made a plot of election term vs time:

terms

The x-axis is time in days. The y-axis is the term number. The vertical lines correspond to etcd cluster restart events (where we restarted etcd because we were redeploying our system). You'll notice two increasingly steep rises. One at day ~7, the other at day ~26. Interestingly the time between the restart at day 0 and day 19 and the rise in elections is roughly the same: 7 days.

The load on the system in question is not particularly high and is generally quite constant. In fact we're seeing similar behavior (and time constants) on two other environments that have dramatically less, and substantial more load.

One last thing. We didn't run lsof before SIGQUIT so, unfortunately, we don't know if we were leaking file descriptors. However, one of our environments has just hit the 7 day limit (and is just starting to show signs of election churn) and lsof on that box looked normal.

So, we don't think this is related to #880 -- but it might be since we are on v0.4.3.

Anyone seen this? Anything else we can mine for data?

@onsi
Copy link
Author

onsi commented Jul 22, 2014

FWIW, we just analyzed the logs on our low-traffic environment (this has ~an order of magnitude less etcd traffic) and we see the same time-delays between restarts and election ramp-ups:

tabasco_terms

@xiang90
Copy link
Contributor

xiang90 commented Jul 23, 2014

@onsi Have you ever seen this in the previous version of etcd?

@onsi
Copy link
Author

onsi commented Jul 23, 2014

No, we have not. And I'm quite confident that we've successfully seen etcd run for longer than 7 days with the older versions we've deployed.

@onsi
Copy link
Author

onsi commented Jul 23, 2014

hey @xiangli-cmu

It looks like it's starting to ramp up on that third environment I mentioned (the really large one). Is there anything diagnostic you'd like me to look at? I'm going to get memory and cpu usage metrics this time (I failed to do that last time. durp).

@onsi
Copy link
Author

onsi commented Jul 23, 2014

sadly we aren't running with -trace so I can't get too much more. Do you know if -trace has major performance implications? can we default to running with -trace?

@onsi
Copy link
Author

onsi commented Jul 23, 2014

Another update. Our third (large) environment has hit the point (7 days!) where this became an issue.

Here's its plot:
prod_terms

This one's real interesting. As you can see there is a long time period (over a month) where etcd did not show this issue. This is etcd 9af9438 (basically 0.3).

The blue-dashed line is where we deployed v0.4.3 - the upgrade goes fine. However now with v0.4.3 we see that after 7 days of uptime the # of elections start to skyrocket.

So -- looks like this was an issue introduced between ~0.3 and v0.4.3.

Some more data:

  • I took this opportunity to run LSOF. There were no leaking file descriptors/open ports. In particular, I observed an election take place and saw the open watch ports (we have about 18 or so) move from the old leader to the new leader. They were shut down correctly on the old leader. I'm pretty sure, then, that this is not related to etcd: add a read/write timeout to server #880 - and it may still be happening in 0.4.5 (I haven't managed to reproduce it without waiting for... 7 days!)
  • I also looked at memory and CPU usage. Memory was hovering at around 800MB for a snapshot file of size ~3 MB and CPU usage was at 5%. After I restarted etcd (which caused it to recover and stop all the reelections) I saw memory usage drop to ~100-200MB and CPU to 1%
  • I shut etcd down with SIGQUIT and - again - saw a large number of
goroutine 4501114 [chan receive]:
net/http.(*persistConn).readLoop(0xc210759080)    /var/vcap/data/packages/golang/aa5f90f06ada376085414bfc0c56c8cd67abba9c.1/src/pkg/net/http/transport.go:778 +0x68f
created by net/http.(*Transport).dialConn
 /var/vcap/data/packages/golang/aa5f90f06ada376085414bfc0c56c8cd67abba9c.1/src/pkg/net/http/transport.go:528 +0x607

I can send you these SIGQUIT traces and etcd logs if you'd like (the latter simply show a litany of leader elections).

@yichengq
Copy link
Contributor

@onsi
The memory leak should be caused by #900.
Could you check the mode of each instance in one cluster? Do they switch between standby and peer a lot?
7 days seems to be a magic number.
I would love to see traces and logs if it is convenient.

@onsi
Copy link
Author

onsi commented Jul 23, 2014

hey @unihorn

we're not using the standby feature. the instances were swapping leadership roles (so leader <=> follower)

I've uploaded the logs and trace dumps:
https://www.dropbox.com/s/k0pny88xvcvxp3g/etcd-logs.zip

the memory leak may well be due to #900 -- but:

  • I'm only at 800 MB and have plenty of headroom
  • does this explain the entirety of the performance degradation?
  • there aren't a crazy number of leaking goroutines (just a ~1000 after seven days) but there defnitely are a few. They don't seem to be related to leaking timers.

Perhaps its garbage collection in the presence of millions of leaking timers that's causing the performance issue? Unclear...

@yichengq
Copy link
Contributor

@onsi
The leader changes happen more and more frequently, and become severe enough starting from ~7 days. My guess is that etcd becomes slower and slower until it cannot even send heartbeat in time.
The situation that millions of timer structs cannot be garbage collected could make new/delete variables rather slow, and may be the reason for degradation.

I check the data_etcd_xxx files, but there are only ~100 goroutines left when killed. Is this a different one from the first one?

@onsi
Copy link
Author

onsi commented Jul 24, 2014

I agree with your intuition. I'll try to validate it tomorrow by
increasing the rate at which the timers accumulate.

The data_etcd_xxx files are correct. etcd only had those goroutines
running at the time that I killed it -- these are the same running etcd
processes that were having the election issues.

Were you expecting to see a leaking goroutine for each timer? Do you know
where all the readloop goroutines are coming from? They may be a sign that
something else isn't getting cleaned up?

Onsi

On Wed, Jul 23, 2014 at 8:13 PM, Yicheng Qin notifications@github.com
wrote:

@onsi https://github.com/onsi
The leader changes happen more and more frequently, and become severe
enough starting from ~7 days. My guess is that etcd becomes slower and
slower until it cannot even send heartbeat in time.
The situation that millions of timer structs cannot be garbage collected
could make new/delete variables rather slow, and may be the reason for
degradation.

I check the data_etcd_xxx files, but there are only ~100 goroutines left
when killed. Is this a different one from the first one?


Reply to this email directly or view it on GitHub
#904 (comment).

@marc-barry
Copy link

@onsi

I created issue #868 a while ago. It appears that #900 may also solve your issue.

@onsi
Copy link
Author

onsi commented Jul 24, 2014

thanks @marc-barry

has it been validated that the timer leak fix fixes the election issue? we
were only seeing the issue emerge after ~7 days of uptime - which is a long
time to wait to validate the issue. I might try a quick experiment to
confirm that this is in fact coming just from the timer leak.

onsi

On Thu, Jul 24, 2014 at 12:29 PM, Marc Barry notifications@github.com
wrote:

@onsi https://github.com/onsi

I created issue #868 #868 a while
ago. It appears that #900 #900 may
also solve your issue.


Reply to this email directly or view it on GitHub
#904 (comment).

@onsi
Copy link
Author

onsi commented Jul 24, 2014

just ran an experiment to verify that the timer leak is causing this. I rigged etcd to leak a timer very frequently and saw elections set in after about 2.5e6 timer allocations. Based on the constants in etcd it's currently leaking 2.5 timers a second so it takes 1e6 seconds ~11 days to get to leader elections. Given my laptop is faster than the aws instances we run on I think this is in the right ballpark of what we're seeing (7 days).

Gonna mark this as closed.

@onsi onsi closed this as completed Jul 24, 2014
@yichengq
Copy link
Contributor

@onsi So glad this could be solved!
To mention that there are only ~100 goroutines when SIGQUIT, I wanna double check the phenomenon of ~1000 goroutines which most of them are in readLoop does exist in your first cluster. This may be a strange case for us to dig in later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants