Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage improvements #1832

Merged
merged 3 commits into from
Oct 2, 2019
Merged

Conversation

charith-elastic
Copy link
Contributor

TL;DR:

  • Minor optimization to GetActualMastersForCluster to avoid copying a lot of unnecessary objects when dealing with large clusters.
  • Update the operator resource limits to provide some headroom for growth (pod or node churn can cause memory spikes)

Details:

While profiling the operator to investigate #1468, the hottest code path with the largest heap usage was consistently in the underlying framework itself (Get and List calls to the controller-runtime client results in DeepCopy of objects and client watches have to constantly parse JSON, issue reflection calls, and decode Base64 to convert objects to Go types).

Findings from the investigation:

  • The operator uses less than 40MiB of heap even when managing multiple Elasticsearch clusters[1]
  • No indication of a memory leak in the code (memory usage was fairly constant over multiple days with small spikes during node churn)
  • A patch to controller-runtime significantly reduces heap allocations (Avoid deep copying objects twice in the CacheReader List method kubernetes-sigs/controller-runtime#621)
  • certificates.Reconcile logic should probably be revisited to try and avoid parsing certificates during each run
  • We should try to avoid List and Get calls as much as possible (reuse where possible)

[1] The operator was managing the following set of clusters during the investigation

  • Single cluster with 3 masters and 24 workers
  • 10 clusters with 1 master and 3 workers

image

image

Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
One reason I could think of that would make the heap grow a little bit higher is the data retrieved from Elasticsearch that we keep in memory in observers. And more generally, any data we fetch from Elasticsearch (even though garbage collected after each reconciliation).

For example the result of _cat/shards will grow with the number of shards in the cluster.
If the cluster is using eg. date-based index name patterns, there's a chance it gets more and more shards over time.
But since this is completely dynamic and depending on ES clusters usage, it's hard to end up with a static memory value that fits all cases.

@anyasabo
Copy link
Contributor

anyasabo commented Oct 1, 2019

❤️ flame graphs. Would it make sense to open a perf issue for the certificate parsing? It looks like that is a much bigger factor than I expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants