-
Notifications
You must be signed in to change notification settings - Fork 260
CPU runaway / memory leak in yaml parser #427
Comments
Can you try running the Helm Operator with the following environmental flag enabled: |
The change in GODEBUG did change one thing (when going back to 256Mi), it gets OOMKilled instead of cycling forever trying to free up memory. But the final result is the same, it crashes or gets killed. Setting it to 512 MB (fixed) and 200m CPU runs fine for me. Are there any guidelines which show what limits are acceptable for the operator? For my personal liking 512 MB for a system which applies helm releases is a bit much :/ // EDIT // EDIT 2 With GODEBUG=madvdontneed=1
Without any env:
Letting the helm operator run without any limits it seems to allocate ~300MB from the host (kubectl top pods reports that) and use around 100-120m CPU which seems weird to me |
@geNAZt are you by any chance active on either the CNCF or Weaveworks Slack community? Would like to provide some insights (and have a chat about it). I am |
No, if you have a link i would be glad to join |
https://slack.cncf.io for an invite, you can find me in |
I am also experiencing this problem; with and without the GODEBUG=madvdontneed=1 flag, getting periodic OOMKilled with version 1.1.0 and below resources. Has there been any progress on this issue?
|
We had experienced the same situation. The issue appeared a few days ago, and when we removed |
Having the same issue, without requests:
cpu: 50m
memory: 256Mi
limits:
cpu: 1
memory: 1Gi Have raised the limit to 2G memory, but even that gets OOM'd. Running 1.2.0 at the moment. |
We are seeing this as well. I just noticed that over the last week one of our Helm Operator instances has grown to use 3 GB of RAM using the default requests/limits (see image below), although it continues to function fine. I've set CPU requests to 500m and memory requests and limits to 512Mi for this deployment to see what difference it makes. |
As others have implied, the best workaround seems to be setting memory limits such that memory growth is bounded at a low enough level that a stop-the-world GC has enough time to complete before triggering a liveness probe failure and a restart of the pod. The memory growth needs to be low enough and the container needs to have enough CPU allocated to complete the collection in time. Attempting to work around the issue by setting a really high memory limit will probably just make the problem worse, not giving enough time for a collection of that size. Here's what I've done:
This combination seems to be working so far. Memory usage grows to about 200Mi over the course of 45 minutes or so, then a collection happens. A few container unhealthy alerts are fired because of the blocked process, but because of the failure threshold increase it's not enough to trigger a restart. I could probably eliminate the unhealthy events by tuning some of the other probe thresholds. I will report back after running this way for a few weeks to see if this workaround is still solid. |
After additional testing, we've figured out the main trigger for the memory leaks, at least in our case. We had some inexperienced chart developers pushing broken charts that were staying in a failing deploy loop for multiple days. This is what was causing the memory leaks in our case. After getting rid of the failing deployments memory usage has been flat for over 24 hours now. It only took 4 failing charts to balloon memory beyond 3 GB over the period of a week or so. With the seemingly imminent (and exciting) arrival of GOTK/Flux 2 I'm going to guess this won't be fixed in the classic Helm Operator any time soon, but we have a workable solution for now until we can move to next-gen Flux. |
Sorry if your issue remains unresolved. The Helm Operator is in maintenance mode, we recommend everybody upgrades to Flux v2 and Helm Controller. A new release of Helm Operator is out this week, 1.4.4. We will continue to support Helm Operator in maintenance mode for an indefinite period of time, and eventually archive this repository. Please be aware that Flux v2 has a vibrant and active developer community who are actively working through minor releases and delivering new features on the way to General Availability for Flux v2. In the mean time, this repo will still be monitored, but support is basically limited to migration issues only. I will have to close many issues today without reading them all in detail because of time constraints. If your issue is very important, you are welcome to reopen it, but due to staleness of all issues at this point a new report is more likely to be in order. Please open another issue if you have unresolved problems that prevent your migration in the appropriate Flux v2 repo. Helm Operator releases will continue as possible for a limited time, as a courtesy for those who still cannot migrate yet, but these are strongly not recommended for ongoing production use as our strict adherence to semver backward compatibility guarantees limit many dependencies and we can only upgrade them so far without breaking compatibility. So there are likely known CVEs that cannot be resolved. We recommend upgrading to Flux v2 which is actively maintained ASAP. I am going to go ahead and close every issue at once today, |
Describe the bug
We started seeing random crashes regarding liveness probes failing in out helm-operator installations. After looking into a profile taken from a running one we saw that the CPU and memory usage climb until the process itself is not responsive anymore.
Another behaviour we saw was that helm release objects get into pending-update state which we have to manually cleanup, i guess thats due to the stale "starting sync run"
To Reproduce
Steps to reproduce the behaviour:
Expected behavior
Not crashing and not corrupting helm releases
Logs
helm-operator logs:
pprof top 10:
Additional context
Maybe related things
After some search i found this:
yaml/libyaml#111
yaml/libyaml#115
which leads me to believe that there is a serious issue in the YAML parsing part which can bring the whole application down without any notice
Current index.yaml from helm stable:
index.yaml.zip
Gitlab index.yaml
gitlab_index.yaml.zip
The text was updated successfully, but these errors were encountered: