Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess impact of watch cache on large production cluster (5k+ namespaces) #16112

Closed
smarterclayton opened this issue Sep 1, 2017 · 9 comments

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 1, 2017

On a large dense cluster memory use of the api server is of primary concern (for scale) due to the watch cache keeping copies of all objects in memory. We conducted a test on a cluster to assess the impact of disabling the watch cache.

Scenario:

  • HA cluster, masters running OpenShift 3.6 (Kube 1.6+openshift patches). ~170 nodes 5k namespaces
  • Disabled watch cache on the API server of the master that the active controller was talking to
  • Measured results via prometheus on that master
  • Restart was at 17:00 UTC

CPU over time for that master cpu (red is apiserver, blue is etcd)

RSS for that master mem (purple is apiserver, brown is controller)

Internal go heap stats for all masters heap

No observed change to tail latency. Still measuring impact to CPU vs number of small watches

@smarterclayton
Copy link
Contributor Author

The watch cache benefits the api server by avoiding decodes from etcd (CPU+garbage), and some by collapsing common watches. It functions best when N x C x W is high - (N = number of resources queried pre filtering, C is number of clients, and W is write rate).

Endpoints and pods are high write, high N, and high clients. Nodes are high write, small N (mostly) and small C. Most other resources are either low write or low client, where the memory use is a wash.

I think the recommendation should be that on dense clusters only resources that hit N x C x W high should have watch cache on. On small clusters the same settings can be used safely. On very large node count clusters, any node accessed resource may want the watch cache enabled.

@smarterclayton
Copy link
Contributor Author

Ran 1k simultaneous endpoint watches (ramping up over 2 minutes), where endpoints are changing at a rate of 22 writes a second). Observed memory increase, but did not stay consistently high. No CPU impact was observed.

A 40-60% reduction in master memory with only a small amount of CPU increase for disabling the watch cache for the majority of resources seems a desirable tradeoff.

@smarterclayton
Copy link
Contributor Author

@openshift/sig-scalability

@jeremyeder
Copy link
Contributor

What about other resources? Disk I/O, network, and sockets? How might this affect performance in a partitioned scenario?

@smarterclayton
Copy link
Contributor Author

Network was the same, sockets was the same, disk/io needs to be measured. We use HTTP/2 to backend etcd so it's all multiplexed.

Define partitioned.

@jeremyeder
Copy link
Contributor

By that I meant ... any implications that differ in a single master vs multi-master scenario.

I see your resource graphs but ... what were the latencies all along the histogram (not just tail)?

I ask because you are saying that the watch cache for (most) resources buys us nothing until we have a certain scale, and you're optimizing for reduced memory usage in the common case. As long as there is supporting data, this makes sense to me. It would be good to have the data supporting this included in the BZ here.

The side-effect of this proposed change is needing to document how to identify when a cluster would benefit from more watch caches, how to enable them (and which to enable), and how to calculate memory requirements for enabling watch caches down the line.

@smarterclayton
Copy link
Contributor Author

There was no change in latency tails. We're going to try the whole cluster tomorrow and will gather that. I don't expect an IO change but will look.

@smarterclayton
Copy link
Contributor Author

Whole cluster conversion, RSS for masters on core services image

Whole cluster converted, in use heap on api-servers heap

  • initial state - one master (blue) has watch cache disabled
  • 14:40 - restarted masters to clear loaded things in heap
  • 15:02 - restarted second master (non-active-controller) to disable watch cache
  • 15:10 - restarted third master (active-controller) to disable watch cache

Master CPU over longer period (core services) cpu

No change to etcd write volume.

@smarterclayton
Copy link
Contributor Author

99th percentile latency did not change significantly.

@pweil- pweil- assigned mfojtik and jeremyeder and unassigned jeremyeder Sep 11, 2017
openshift-merge-robot added a commit that referenced this issue Sep 26, 2017
Automatic merge from submit-queue (batch tested with PRs 16546, 16398, 16157)

Backport upstream changes to watch cache enablement

Disables the watch cache for most resources by default, except those accessed by many clients. This has been shown to have minor impacts on the production workload.

Fixes #16112
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants