Kubernetes 1.18+ with server side apply featuregate causes huge CPU usage and execution time. #1372
Labels
dry-run-diff
Related to dry run diff behavior
impact/performance
Something is slower than expected
impact/usability
Something that impacts users' ability to use the product easily and intuitively
kind/bug
Some behavior is incorrect or out of spec
resolution/fixed
This issue was fixed
Milestone
Problem description
Howdy, we're rolling out Kubernetes 1.18 in our infra and running into some fun problems that only occur when the provider is pointed to a 1.18 cluster with featuregate ServerSideApply set to true. Our Pulumi program written in Python creates an entire EKS cluster (much like pulumi-eks does) and then adds a bunch of daemonsets/deployments to the cluster.
One of our deployments (calico-typha-autoscaler) needs the autogenerated Pulumi name from another deployment (calico-typha) in order to start.
We noticed that when targeting a 1.18 cluster, pulumi-language-python-exec would take 100% CPU and that pulumi up now took 35 minutes instead of 30 seconds. I then spun up a 1.18 KinD cluster with serversideapply set to false, and Pulumi went back to only taking 30 seconds. I then targeted a 1.17 EKS cluster and it worked fine, so its definitely something with the ServerSideApply feature.
I patched pulumi-language-python-exec to support CProfile (top cumulative time results here) and a debugger and took a look at what it was doing, and noticed that it was overwhelingly spending its time creating Output objects as it struggled to parse every managedField as an Input.
From my untrained eye (and I could be talking completely out my ass here), it seems like the managedFields dictionary is incredibly expensive to parse, as it is a huge dictionary containing other dictionaries. Here are logs from the Pulumi engine itself showing a ton of Outputs being made for the calico-typha deployment, where its recursively parsing the entire dictionary and every step it takes makes a new Output. From what I can tell on the Python side, this is then causing a ton of from_input calls where each recursive step is exploding a dictionary.
I think part of the problem may come from this workaround, we have a long standing issue where we have to do a pulumi.Output.all on resource.id and resource.metadata to make sure that the "name" of the kubernetes resource is Known even when the Kubernetes cluster doesn't exist yet. #906 has more details.
Reproducing the issue
https://github.com/jeid64/pulumi-provider-bug/blob/118bug/__main__.py
The 118bug branch contains a full repro. Instead of deploying calico, it will deploy "nginx" as the container image, using the rest of the manifest we would use for Calico. If you point it at a 1.18 EKS cluster you'll see it takes 35 minutes for pulumi up to complete, and on 1.17 30 seconds. It's just 3 kubernetes resources, let me know if theres any question or confusion.
Suggestions for a fix
Is there anyway to ignore the managedFields dictionary from the Go provider before it sends it to the Python part of the provider? I know there is upcoming support in pulumi-kubernetes to use it for dry run support, dunno if ignoring the field would cause problems for that.
The text was updated successfully, but these errors were encountered: