Kubernetes 1.18+ with server side apply featuregate causes huge CPU usage and execution time. #1372

jeid64 · 2020-11-10T21:55:38Z

Problem description

Howdy, we're rolling out Kubernetes 1.18 in our infra and running into some fun problems that only occur when the provider is pointed to a 1.18 cluster with featuregate ServerSideApply set to true. Our Pulumi program written in Python creates an entire EKS cluster (much like pulumi-eks does) and then adds a bunch of daemonsets/deployments to the cluster.

One of our deployments (calico-typha-autoscaler) needs the autogenerated Pulumi name from another deployment (calico-typha) in order to start.

We noticed that when targeting a 1.18 cluster, pulumi-language-python-exec would take 100% CPU and that pulumi up now took 35 minutes instead of 30 seconds. I then spun up a 1.18 KinD cluster with serversideapply set to false, and Pulumi went back to only taking 30 seconds. I then targeted a 1.17 EKS cluster and it worked fine, so its definitely something with the ServerSideApply feature.

I patched pulumi-language-python-exec to support CProfile (top cumulative time results here) and a debugger and took a look at what it was doing, and noticed that it was overwhelingly spending its time creating Output objects as it struggled to parse every managedField as an Input.

From my untrained eye (and I could be talking completely out my ass here), it seems like the managedFields dictionary is incredibly expensive to parse, as it is a huge dictionary containing other dictionaries. Here are logs from the Pulumi engine itself showing a ton of Outputs being made for the calico-typha deployment, where its recursively parsing the entire dictionary and every step it takes makes a new Output. From what I can tell on the Python side, this is then causing a ton of from_input calls where each recursive step is exploding a dictionary.

I think part of the problem may come from this workaround, we have a long standing issue where we have to do a pulumi.Output.all on resource.id and resource.metadata to make sure that the "name" of the kubernetes resource is Known even when the Kubernetes cluster doesn't exist yet. #906 has more details.

Reproducing the issue

https://github.com/jeid64/pulumi-provider-bug/blob/118bug/__main__.py
The 118bug branch contains a full repro. Instead of deploying calico, it will deploy "nginx" as the container image, using the rest of the manifest we would use for Calico. If you point it at a 1.18 EKS cluster you'll see it takes 35 minutes for pulumi up to complete, and on 1.17 30 seconds. It's just 3 kubernetes resources, let me know if theres any question or confusion.

Suggestions for a fix

Is there anyway to ignore the managedFields dictionary from the Go provider before it sends it to the Python part of the provider? I know there is upcoming support in pulumi-kubernetes to use it for dry run support, dunno if ignoring the field would cause problems for that.

jeid64 · 2020-11-10T21:59:18Z

@pgavlin @lblackstone related to your work with server side apply?

lblackstone · 2020-11-10T22:10:22Z

Thanks for the detailed bug report. We've fixed performance problems in the past related to deeply nested Outputs, so I'll bet you're right about it being related to the managedFields. I'll take a look at it this week.

jeid64 · 2021-01-12T14:18:33Z

@lblackstone any updates on this? thanks!

lblackstone · 2021-01-12T17:17:51Z

Sorry, I got pulled off on other work, and haven't had a chance to come back to this yet. It's still on my TODO list.

jeid64 · 2021-05-05T13:15:55Z

Any updates on this ticket? We've pretty much stopped using .apply because of it.

lblackstone · 2021-05-05T15:25:14Z

Sorry for the delay on this. I'll take a look as part of #1556

lblackstone · 2021-05-29T15:22:10Z

I suspect this might be the same underlying issue as #1597

These mutually recursive functions unintentionally had exponential complexity in nesting depth of objects, arg types and most likely arrays. Remove the exponential complexity by avoiding direct recursion of from_input on itself, and relying on mutual recursion with all alone to reduce nested substructure. Also simplify the implementation to aid readability. Fixes pulumi/pulumi-kubernetes#1597. Fixes pulumi/pulumi-kubernetes#1425. Fixes pulumi/pulumi-kubernetes#1372. Fixes #3987.

lblackstone self-assigned this Nov 10, 2020

lblackstone added impact/performance Something is slower than expected impact/usability Something that impacts users' ability to use the product easily and intuitively labels Nov 10, 2020

lblackstone added this to the current milestone Nov 10, 2020

leezen removed this from the current milestone Jan 31, 2021

lblackstone added the dry-run-diff Related to dry run diff behavior label May 5, 2021

lukehoban mentioned this issue May 31, 2021

[sdk/python] Avoid exponential complexity for from_input/all pulumi/pulumi#7175

Merged

lukehoban closed this as completed in pulumi/pulumi#7175 Jun 1, 2021

pulumi-bot added the resolution/fixed This issue was fixed label Jun 1, 2021

lukehoban added this to the 0.57 milestone Jun 1, 2021

lukehoban added the kind/bug Some behavior is incorrect or out of spec label Jun 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes 1.18+ with server side apply featuregate causes huge CPU usage and execution time. #1372

Kubernetes 1.18+ with server side apply featuregate causes huge CPU usage and execution time. #1372

jeid64 commented Nov 10, 2020

jeid64 commented Nov 10, 2020

lblackstone commented Nov 10, 2020

jeid64 commented Jan 12, 2021

lblackstone commented Jan 12, 2021

jeid64 commented May 5, 2021

lblackstone commented May 5, 2021

lblackstone commented May 29, 2021

Kubernetes 1.18+ with server side apply featuregate causes huge CPU usage and execution time. #1372

Kubernetes 1.18+ with server side apply featuregate causes huge CPU usage and execution time. #1372

Comments

jeid64 commented Nov 10, 2020

Problem description

Reproducing the issue

Suggestions for a fix

jeid64 commented Nov 10, 2020

lblackstone commented Nov 10, 2020

jeid64 commented Jan 12, 2021

lblackstone commented Jan 12, 2021

jeid64 commented May 5, 2021

lblackstone commented May 5, 2021

lblackstone commented May 29, 2021