Memory usage improvements #1832

charith-elastic · 2019-10-01T15:07:46Z

TL;DR:

Minor optimization to GetActualMastersForCluster to avoid copying a lot of unnecessary objects when dealing with large clusters.
Update the operator resource limits to provide some headroom for growth (pod or node churn can cause memory spikes)

Details:

While profiling the operator to investigate #1468, the hottest code path with the largest heap usage was consistently in the underlying framework itself (Get and List calls to the controller-runtime client results in DeepCopy of objects and client watches have to constantly parse JSON, issue reflection calls, and decode Base64 to convert objects to Go types).

Findings from the investigation:

The operator uses less than 40MiB of heap even when managing multiple Elasticsearch clusters[1]
No indication of a memory leak in the code (memory usage was fairly constant over multiple days with small spikes during node churn)
A patch to controller-runtime significantly reduces heap allocations (Avoid deep copying objects twice in the CacheReader List method kubernetes-sigs/controller-runtime#621)
certificates.Reconcile logic should probably be revisited to try and avoid parsing certificates during each run
We should try to avoid List and Get calls as much as possible (reuse where possible)

[1] The operator was managing the following set of clusters during the investigation

Single cluster with 3 masters and 24 workers
10 clusters with 1 master and 3 workers

sebgl

LGTM.
One reason I could think of that would make the heap grow a little bit higher is the data retrieved from Elasticsearch that we keep in memory in observers. And more generally, any data we fetch from Elasticsearch (even though garbage collected after each reconciliation).

For example the result of _cat/shards will grow with the number of shards in the cluster.
If the cluster is using eg. date-based index name patterns, there's a chance it gets more and more shards over time.
But since this is completely dynamic and depending on ES clusters usage, it's hard to end up with a static memory value that fits all cases.

anyasabo · 2019-10-01T20:07:54Z

❤️ flame graphs. Would it make sense to open a perf issue for the certificate parsing? It looks like that is a much bigger factor than I expected

charith-elastic added the >refactoring label Oct 1, 2019

sebgl approved these changes Oct 1, 2019

View reviewed changes

anyasabo approved these changes Oct 1, 2019

View reviewed changes

charith-elastic added 3 commits October 2, 2019 08:09

Optimize GetActualMasters

df43981

Increase operator limits

c803e76

Cleanup dependencies

0d0d069

charith-elastic force-pushed the perf branch from 3dbe2e1 to 0d0d069 Compare October 2, 2019 07:11

charith-elastic mentioned this pull request Oct 2, 2019

Review Certificate Reconciliation Logic #1841

Open

charith-elastic merged commit 9768b3a into elastic:master Oct 2, 2019

charith-elastic deleted the perf branch October 2, 2019 07:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage improvements #1832

Memory usage improvements #1832

charith-elastic commented Oct 1, 2019

sebgl left a comment

anyasabo commented Oct 1, 2019

Memory usage improvements #1832

Memory usage improvements #1832

Conversation

charith-elastic commented Oct 1, 2019

sebgl left a comment

Choose a reason for hiding this comment

anyasabo commented Oct 1, 2019