Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

CPU runaway / memory leak in yaml parser #427

Closed
geNAZt opened this issue May 24, 2020 · 13 comments
Closed

CPU runaway / memory leak in yaml parser #427

geNAZt opened this issue May 24, 2020 · 13 comments
Labels
blocked needs validation In need of validation before further action bug Something isn't working

Comments

@geNAZt
Copy link

geNAZt commented May 24, 2020

Describe the bug

We started seeing random crashes regarding liveness probes failing in out helm-operator installations. After looking into a profile taken from a running one we saw that the CPU and memory usage climb until the process itself is not responsive anymore.

Another behaviour we saw was that helm release objects get into pending-update state which we have to manually cleanup, i guess thats due to the stale "starting sync run"

To Reproduce

Steps to reproduce the behaviour:

  1. Install helm operator with cpu and memory limits of 500m and 256Mi
  2. Install some charts from helm stable charts
  3. See how the helm operator gets cleaned up by liveness probe leaving helm releases broken

Expected behavior

Not crashing and not corrupting helm releases

Logs

helm-operator logs:

W0524 08:49:49.261320       6 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ts=2020-05-24T08:49:49.29648543Z caller=operator.go:82 component=operator info="setting up event handlers"
ts=2020-05-24T08:49:49.29653343Z caller=operator.go:98 component=operator info="event handlers set up"
ts=2020-05-24T08:49:49.296856635Z caller=main.go:287 component=helm-operator info="waiting for informer caches to sync"
ts=2020-05-24T08:49:49.397055732Z caller=main.go:292 component=helm-operator info="informer caches synced"
ts=2020-05-24T08:49:49.397139534Z caller=git.go:104 component=gitchartsync info="starting sync of git chart sources"
ts=2020-05-24T08:49:49.397141434Z caller=operator.go:110 component=operator info="starting operator"
ts=2020-05-24T08:49:49.397198135Z caller=operator.go:112 component=operator info="starting workers"
ts=2020-05-24T08:49:49.398192549Z caller=server.go:42 component=daemonhttp info="starting HTTP server on :3030"
ts=2020-05-24T08:49:49.398508054Z caller=release.go:75 component=release release=minio targetNamespace=gitlab resource=gitlab:helmrelease/minio helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:49:49.398857159Z caller=release.go:75 component=release release=svcat targetNamespace=svcat resource=svcat:helmrelease/svcat helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:49:50.189199369Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=0.10.1
ts=2020-05-24T08:49:50.989017121Z caller=release.go:249 component=release release=svcat targetNamespace=svcat resource=svcat:helmrelease/svcat helmVersion=v3 info="running dry-run upgrade to compare with release version '5'" action=dry-run-compare
ts=2020-05-24T08:49:50.991875863Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for svcat" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:51.00369044Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:53.915386944Z caller=helm.go:69 component=helm version=v3 info="performing update for svcat" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:54.100451308Z caller=helm.go:69 component=helm version=v3 info="dry run for svcat" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:54.392106666Z caller=release.go:268 component=release release=svcat targetNamespace=svcat resource=svcat:helmrelease/svcat helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:49:54.392299469Z caller=release.go:75 component=release release=elasticsearch targetNamespace=graylog resource=graylog:helmrelease/elasticsearch helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:49:58.510576689Z caller=release.go:105 component=release release=elasticsearch targetNamespace=graylog resource=graylog:helmrelease/elasticsearch helmVersion=v3 error="failed to determine sync action for release: status 'pending-upgrade' of release does not allow a safe upgrade"
ts=2020-05-24T08:49:58.511182698Z caller=release.go:75 component=release release=gitlab-runner targetNamespace=gitlab resource=gitlab:helmrelease/gitlab-runner helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:50:04.356173295Z caller=release.go:249 component=release release=minio targetNamespace=gitlab resource=gitlab:helmrelease/minio helmVersion=v3 info="running dry-run upgrade to compare with release version '1'" action=dry-run-compare
ts=2020-05-24T08:50:04.356167495Z caller=release.go:249 component=release release=gitlab-runner targetNamespace=gitlab resource=gitlab:helmrelease/gitlab-runner helmVersion=v3 info="running dry-run upgrade to compare with release version '1'" action=dry-run-compare
ts=2020-05-24T08:50:04.499243431Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for gitlab-runner" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:04.49980734Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for minio" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:04.516823894Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:04.550429796Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:05.900198651Z caller=helm.go:69 component=helm version=v3 info="performing update for minio" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:05.906840951Z caller=helm.go:69 component=helm version=v3 info="performing update for gitlab-runner" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:05.909939397Z caller=helm.go:69 component=helm version=v3 info="dry run for minio" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:05.916438994Z caller=helm.go:69 component=helm version=v3 info="dry run for gitlab-runner" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:06.29659257Z caller=release.go:268 component=release release=gitlab-runner targetNamespace=gitlab resource=gitlab:helmrelease/gitlab-runner helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:50:06.303724677Z caller=release.go:75 component=release release=osba targetNamespace=osba resource=osba:helmrelease/osba helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:50:06.503599862Z caller=release.go:268 component=release release=minio targetNamespace=gitlab resource=gitlab:helmrelease/minio helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:50:06.503798764Z caller=release.go:75 component=release release=chartmuseum targetNamespace=chartmuseum resource=chartmuseum:helmrelease/chartmuseum helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:50:13.194666158Z caller=release.go:249 component=release release=osba targetNamespace=osba resource=osba:helmrelease/osba helmVersion=v3 info="running dry-run upgrade to compare with release version '3'" action=dry-run-compare
ts=2020-05-24T08:50:14.195036991Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for osba" targetNamespace=osba release=osba
ts=2020-05-24T08:50:14.488930578Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=osba release=osba
ts=2020-05-24T08:50:17.392153912Z caller=helm.go:69 component=helm version=v3 info="performing update for osba" targetNamespace=osba release=osba
ts=2020-05-24T08:50:17.691649982Z caller=helm.go:69 component=helm version=v3 info="dry run for osba" targetNamespace=osba release=osba
ts=2020-05-24T08:50:17.991267853Z caller=release.go:268 component=release release=osba targetNamespace=osba resource=osba:helmrelease/osba helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:50:17.993421085Z caller=release.go:75 component=release release=graylog targetNamespace=graylog resource=graylog:helmrelease/graylog helmVersion=v3 info="starting sync run"

pprof top 10:

Showing top 10 nodes out of 371
      flat  flat%   sum%        cum   cum%
    1610ms 15.32% 15.32%     1610ms 15.32%  runtime.memclrNoHeapPointers
     850ms  8.09% 23.41%      850ms  8.09%  math/big.addMulVVW
     420ms  4.00% 27.40%      800ms  7.61%  runtime.scanobject
     290ms  2.76% 30.16%     1160ms 11.04%  math/big.nat.montgomery
     280ms  2.66% 32.83%      850ms  8.09%  gopkg.in/yaml%2ev2.yaml_parser_scan_plain_scalar
     270ms  2.57% 35.39%      270ms  2.57%  runtime.futex
     270ms  2.57% 37.96%      270ms  2.57%  syscall.Syscall
     250ms  2.38% 40.34%     2950ms 28.07%  runtime.mallocgc
     220ms  2.09% 42.44%      220ms  2.09%  runtime.memmove
     170ms  1.62% 44.05%      230ms  2.19%  encoding/json.checkValid

Additional context

Maybe related things
After some search i found this:

yaml/libyaml#111
yaml/libyaml#115

which leads me to believe that there is a serious issue in the YAML parsing part which can bring the whole application down without any notice

Current index.yaml from helm stable:
index.yaml.zip

Gitlab index.yaml
gitlab_index.yaml.zip

@geNAZt geNAZt added blocked needs validation In need of validation before further action bug Something isn't working labels May 24, 2020
@hiddeco
Copy link
Member

hiddeco commented May 25, 2020

Can you try running the Helm Operator with the following environmental flag enabled: GODEBUG=madvdontneed=1? I am wondering if we are bitten by the new GC behaviour of Golang since >=1.12.

@geNAZt
Copy link
Author

geNAZt commented May 25, 2020

The change in GODEBUG did change one thing (when going back to 256Mi), it gets OOMKilled instead of cycling forever trying to free up memory. But the final result is the same, it crashes or gets killed.

Setting it to 512 MB (fixed) and 200m CPU runs fine for me. Are there any guidelines which show what limits are acceptable for the operator? For my personal liking 512 MB for a system which applies helm releases is a bit much :/

// EDIT
After some further investigation it seems that running with 200m is the limit which is needed. If i give less CPU it crashes right away after 30 seconds, giving 500m allows me to reduce the memory. This leads me to believe that the GC doesn't get enough time to cleanup.

// EDIT 2
To clear up some things i will provide a table which crashes and what

With GODEBUG=madvdontneed=1

256MB memory 512MB memory
200m CPU crashes (oomkilled) crashes (after 5-10 minutes, gets liveness probe failures)
500m CPU crashes (oomkilled) works

Without any env:

256MB memory 512MB memory
200m CPU crashes (after 5-10 minutes, gets liveness probe failures) crashes (after 5-10 minutes, gets liveness probe failures)
500m CPU crashes (after 5-10 minutes, gets liveness probe failures) works

Letting the helm operator run without any limits it seems to allocate ~300MB from the host (kubectl top pods reports that) and use around 100-120m CPU which seems weird to me

@hiddeco
Copy link
Member

hiddeco commented May 25, 2020

@geNAZt are you by any chance active on either the CNCF or Weaveworks Slack community? Would like to provide some insights (and have a chat about it). I am @hidde on both communities.

@geNAZt
Copy link
Author

geNAZt commented May 25, 2020

No, if you have a link i would be glad to join

@hiddeco
Copy link
Member

hiddeco commented May 25, 2020

https://slack.cncf.io for an invite, you can find me in #flux (and #flux-dev).

@ewarwick
Copy link

I am also experiencing this problem; with and without the GODEBUG=madvdontneed=1 flag, getting periodic OOMKilled with version 1.1.0 and below resources. Has there been any progress on this issue?

Limits:
  cpu:     1500m
  memory:  2Gi
Requests:
  cpu:      1
  memory:   1512M

@kuwabarahiroshi
Copy link

Letting the helm operator run without any limits it seems to allocate ~300MB from the host (kubectl top pods reports that) and use around 100-120m CPU which seems weird to me

We had experienced the same situation. The issue appeared a few days ago, and when we removed limits configuration for helm-operator, it stopped periodic OOMKilled.

@jonkerj
Copy link

jonkerj commented Oct 19, 2020

Having the same issue, without GODEBUG flag. These are my resource settings:

requests:
  cpu: 50m
  memory: 256Mi
limits:
  cpu: 1
  memory: 1Gi

Have raised the limit to 2G memory, but even that gets OOM'd. Running 1.2.0 at the moment.

@Rathgore
Copy link

We are seeing this as well. I just noticed that over the last week one of our Helm Operator instances has grown to use 3 GB of RAM using the default requests/limits (see image below), although it continues to function fine. I've set CPU requests to 500m and memory requests and limits to 512Mi for this deployment to see what difference it makes.

Screen Shot 2020-10-26 at 2 45 04 PM

@Rathgore
Copy link

Rathgore commented Oct 26, 2020

In another environment it is rising well above 3 GB then fails the health check and restarts as described in the first comment.

Screen Shot 2020-10-26 at 2 58 15 PM

@Rathgore
Copy link

Rathgore commented Oct 27, 2020

As others have implied, the best workaround seems to be setting memory limits such that memory growth is bounded at a low enough level that a stop-the-world GC has enough time to complete before triggering a liveness probe failure and a restart of the pod. The memory growth needs to be low enough and the container needs to have enough CPU allocated to complete the collection in time. Attempting to work around the issue by setting a really high memory limit will probably just make the problem worse, not giving enough time for a collection of that size.

Here's what I've done:

  • Set memory requests and limits to 512Mi
  • Set CPU requests to 500m with 1000m limit to provide enough CPU for the collections to complete
  • Increased the liveness probe failure threshold to 10, to ensure that a few failures while the process is blocked don't cause it to restart.

This combination seems to be working so far. Memory usage grows to about 200Mi over the course of 45 minutes or so, then a collection happens. A few container unhealthy alerts are fired because of the blocked process, but because of the failure threshold increase it's not enough to trigger a restart. I could probably eliminate the unhealthy events by tuning some of the other probe thresholds.

I will report back after running this way for a few weeks to see if this workaround is still solid.

@Rathgore
Copy link

After additional testing, we've figured out the main trigger for the memory leaks, at least in our case. We had some inexperienced chart developers pushing broken charts that were staying in a failing deploy loop for multiple days. This is what was causing the memory leaks in our case. After getting rid of the failing deployments memory usage has been flat for over 24 hours now. It only took 4 failing charts to balloon memory beyond 3 GB over the period of a week or so.

With the seemingly imminent (and exciting) arrival of GOTK/Flux 2 I'm going to guess this won't be fixed in the classic Helm Operator any time soon, but we have a workable solution for now until we can move to next-gen Flux.

@kingdonb
Copy link
Member

kingdonb commented Sep 2, 2022

Sorry if your issue remains unresolved. The Helm Operator is in maintenance mode, we recommend everybody upgrades to Flux v2 and Helm Controller.

A new release of Helm Operator is out this week, 1.4.4.

We will continue to support Helm Operator in maintenance mode for an indefinite period of time, and eventually archive this repository.

Please be aware that Flux v2 has a vibrant and active developer community who are actively working through minor releases and delivering new features on the way to General Availability for Flux v2.

In the mean time, this repo will still be monitored, but support is basically limited to migration issues only. I will have to close many issues today without reading them all in detail because of time constraints. If your issue is very important, you are welcome to reopen it, but due to staleness of all issues at this point a new report is more likely to be in order. Please open another issue if you have unresolved problems that prevent your migration in the appropriate Flux v2 repo.

Helm Operator releases will continue as possible for a limited time, as a courtesy for those who still cannot migrate yet, but these are strongly not recommended for ongoing production use as our strict adherence to semver backward compatibility guarantees limit many dependencies and we can only upgrade them so far without breaking compatibility. So there are likely known CVEs that cannot be resolved.

We recommend upgrading to Flux v2 which is actively maintained ASAP.

I am going to go ahead and close every issue at once today,
Thanks for participating in Helm Operator and Flux! 💚 💙

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked needs validation In need of validation before further action bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants