Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spike in ETCD IO: raft apply, peer traffic #7981

Closed
armstrongli opened this issue May 24, 2017 · 43 comments
Closed

spike in ETCD IO: raft apply, peer traffic #7981

armstrongli opened this issue May 24, 2017 · 43 comments
Assignees

Comments

@armstrongli
Copy link

armstrongli commented May 24, 2017

We got peer traffic spike in ETCD cluster:
screen shot 2017-05-25 at 1 24 31 am

screen shot 2017-05-25 at 1 14 02 am

After reviewing all the logs of the ETCD members, I didn't find out any abnormal logs.

The guess of this situation is that:

  • All(most of) ETCD followers' raft logs are different from leader's
  • ETCD leader is sending whole data to all ETCD followers

The data transferred is about 500MB.

  • All the members are using SSD
  • All the members are running in the same network
  • All the servers' time is synced
  • No leader election.
  • No client traffic drops
  • No DB size spike
  • No memory spike

Every ETCD server has 10 CPUs, and 20GB memory.

@gyuho
Copy link
Contributor

gyuho commented May 24, 2017

Do you have server logs during that spikes? That would be helpful for us to debug. Thanks!

@armstrongli
Copy link
Author

armstrongli commented May 24, 2017

@gyuho I guess there must be something wrong with the storage side(in source code).
we have encountered such issues:

  • raft sync, but data different
  • sudden memory increase until OOM killed
  • raft spike without any signs
  • data extremely unbalanced

@gyuho
Copy link
Contributor

gyuho commented May 24, 2017

@armstrongli What's the etcd version? There was memory leak issue, if you use etcd <=3.1.4.

https://github.com/coreos/etcd/blob/master/NEWS#L7

@armstrongli
Copy link
Author

armstrongli commented May 24, 2017

Yes. it's < 3.1.4.
Thanks. We will upgrade our cluster to >= 3.1.5.

@xiang90
Copy link
Contributor

xiang90 commented May 24, 2017

It seems there are a lot of raft proposals received around 06:46. So leader sent out more data to followers. that is expected. RPC is not increased, but it only tracks v3 rpc. Probably you have v2 application? Please always provide server logs of ALL members.

@armstrongli
Copy link
Author

armstrongli commented May 24, 2017

Probably you have v2 application?

All the ETCD members are V3, and all the kube-apiservers are using etcd3 as the backend storage

@xiang90
Copy link
Contributor

xiang90 commented May 25, 2017

@armstrongli Then I do not understand where are these proposal came from. Your guess does not seem correct. Clearly it was the proposals causing the peer traffic spikes, not the internal syncing. I still need the log to confirm.

@xinxiaogang
Copy link

@xiang90 what is the raft proposal exact meaning?

@xinxiaogang
Copy link

@xiang90 Also, in the presented dashboard, there is also Disk sync duration spike along with Peer traffic spike. So all these caused by Raft proposal spike?
Another questions, will this Disk sync duration spike cause leader election and leader change?

@xinxiaogang
Copy link

xinxiaogang commented May 25, 2017

I am in the same organization with @armstrongli , for now, we actually see two major issues with our ETCD cluster:

  1. The Peer traffic spike happens randomly like presented by @armstrongli
  2. Frequently leader change as below (last 12 hours' data)
    image

We are not sure whether these two issues are related or not.

@armstrongli
Copy link
Author

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@armstrongli I need the log that corresponds to the graph. The log you provided are from 22:00 to 23:00, not 08:00 to 20:00. Please provide useful information. It is really hard for me to understand what you want and to interpret the data you provided. It is a waste of time for both side if the information is not accurate.

@armstrongli
Copy link
Author

The graph is in Beijing time. And the log is in UTC time.

@armstrongli
Copy link
Author

@xiang90 The logs are for the graph I write in the issue description.

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@armstrongli Can you please convert the timestamp so that they matches?

@mitake
Copy link
Contributor

mitake commented May 26, 2017

Hi @armstrongli @xinxiaogang

I'm still not sure about your case, but in some extreme cases, etcd's batched append strategy would be able to cause high spikes. It is because etcd batches proposals (almost equal to requests from clients) into a single AppendEntries(). This strategy improves both of throughput and latency significantly for ordinary workloads. However, I think that the strategy can make too large delay if the number of proposal in an interval is extremely large because current etcd doesn't provide a parameter for tuning the batching strategy. But please note that I'm not sure about the problem. I've never seen problems like your case.

https://github.com/mitake/etcd/tree/batch-append-limit in this branch you can find a new option --batch-append that takes a single integer as its argument. The number corresponds to the limit of the batched entries. I'm glad if you can try it in your environment and report the result.

@xiang90 how do you think about monitoring the numbers of appended entries? I can work on it.

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@armstrongli

I can confirm trom the log, there was no snapshot transfer or election happened. There is a proposal spike which caused the io spike as I mentioned.

Do you enable auth for the cluster?

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@mitake

If you look at the RPC rate, it is actually very low. 80 rpcs per seconds. but the proposal rate is high. I am wondering if auth retry causes that.

@mitake
Copy link
Contributor

mitake commented May 26, 2017

@xiang90 ah yes, rpc rate isn't high at all, I missed it. I'll also check the auth retry.

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@mitake Yea... The 80k pending proposal is just crazy. I do not know what triggered it.

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@mitake I believe they run etcd 3.0.15. I remember we fixed something auth retry related? /cc @heyitsanthony

@mitake
Copy link
Contributor

mitake commented May 26, 2017

@xiang90 39e9b1f would be the change? I looked at it but I'm still not sure it can produce the large pending proposal yet.

@armstrongli
Copy link
Author

@xiang90

Do you enable auth for the cluster?

No. The auth is not enabled.
cc @mitake @heyitsanthony

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@armstrongli Not sure what was going on then. Something was hitting etcdv2 api pretty hard is my best guess.

@armstrongli
Copy link
Author

There're no application reaching etcd except kube-apiserver. And all the apiservers are using --storage-backend=etcd3 .
@xiang90 Is there any tool/metrics to catch/show the requests from v2?

@armstrongli
Copy link
Author

@xiang90 @mitake
our ETCD cluster goes crazy for almost 1 hour without any sign of stop:

screen shot 2017-05-26 at 7 20 05 pm

Now it's not one spike anymore.

@mitake
Copy link
Contributor

mitake commented May 26, 2017

@armstrongli if you have a time, could you try this commit: mitake@f83d5a2 ? If you append a new option like this --batch-append 1000, the leader node will submit AppendEntries() RPC forcibly when it has 1000 entries without queuing succeeding proposals.

BTW how many nodes does your k8s cluster have?

@xinxiaogang
Copy link

@mitake we have 200 nodes, 3000 pods, etc.

@xinxiaogang
Copy link

@mitake @xiang90 From our monitoring, the client traffic reaches around 10+Mbps for each ETCD node.

@xinxiaogang
Copy link

So from the dashboard, etcd node keeps doing proposal during that time with large entry commit. How can we know where these proposal from?

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

So from the dashboard, etcd node keeps doing proposal during that time with large entry commit. How can we know where these proposal from?

monitor you applications. etcd does not support audit at the moment.

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@xiang90 Is there any tool/metrics to catch/show the requests from v2?

cURL etcd at {endpoint}/metrics. there will be some v2 related metrics:

https://github.com/coreos/etcd/blob/master/Documentation/v2/metrics.md#http-requests

One thing to notice is that the dashboard setup might be incorrect too. So the actual RPC rate might not be what you showed us on the dashboard. We are not sure what really happened still.

@xinxiaogang
Copy link

xinxiaogang commented May 27, 2017

monitor you applications. etcd does not support audit at the moment.

Monitor for what metrics of my application? The application talking to etcd is kube-apiserver. We have do audit of our apiserver request, it is normal as usual.

@xiang90
Copy link
Contributor

xiang90 commented May 27, 2017

@xinxiaogang You need to monitor what requests your application sent to etcd, the dababase. not what requests your application received. that part has nothing to do with etcd directly. I have no idea where these proposals coming from. etcd will not generate proposals out of nothing...

again, i am also not sure if the graph you guys provided are accurate. that is something you have to figure out.

we do not really have enough data to help you.

@xinxiaogang
Copy link

@xiang90 Is there a way or metrics I can get from etcd of the incoming requests from application? I know etcd peers also use http to communicate, how to exclude the peer talk while only leave the application call?

@xiang90
Copy link
Contributor

xiang90 commented May 27, 2017

@xinxiaogang

as i mentioned above, you should read https://github.com/coreos/etcd/blob/master/Documentation/v2/metrics.md#http-requests.

These metrics describe the serving of requests (non-watch events) served by etcd members in non-proxy mode: total incoming requests, request failures and processing latency (inc. raft rounds for storage). They are useful for tracking user-generated traffic hitting the etcd cluster .

it answers your question.

@xiang90
Copy link
Contributor

xiang90 commented May 27, 2017

for v3 metrics, read the doc here https://github.com/coreos/etcd/blob/master/Documentation/metrics.md

@xinxiaogang
Copy link

xinxiaogang commented May 27, 2017

We are running out etcd on v3, so we should get the v3 metrics, correct?
RPC related metrics are only in v2?

@xinxiaogang
Copy link

There is only method GET in the metrics?

# HELP etcd_http_received_total Counter of requests received into the system (successfully parsed and authd).
# TYPE etcd_http_received_total counter
etcd_http_received_total{method="GET"} 1

@xiang90
Copy link
Contributor

xiang90 commented May 27, 2017

@xinxiaogang can you paste all the metrics to here?

@xiang90
Copy link
Contributor

xiang90 commented May 27, 2017

of all your servers

@xinxiaogang
Copy link

xinxiaogang commented May 27, 2017

@xiang90
Copy link
Contributor

xiang90 commented Jun 14, 2017

we reproduced this on our own testing k8s cluster. check #8096. closing this since it is unhelpful.

@xiang90 xiang90 closed this as completed Jun 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants