Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes 1.18+ with server side apply featuregate causes huge CPU usage and execution time. #1372

Closed
jeid64 opened this issue Nov 10, 2020 · 7 comments · Fixed by pulumi/pulumi#7175
Assignees
Labels
dry-run-diff Related to dry run diff behavior impact/performance Something is slower than expected impact/usability Something that impacts users' ability to use the product easily and intuitively kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@jeid64
Copy link

jeid64 commented Nov 10, 2020

Problem description

Howdy, we're rolling out Kubernetes 1.18 in our infra and running into some fun problems that only occur when the provider is pointed to a 1.18 cluster with featuregate ServerSideApply set to true. Our Pulumi program written in Python creates an entire EKS cluster (much like pulumi-eks does) and then adds a bunch of daemonsets/deployments to the cluster.

One of our deployments (calico-typha-autoscaler) needs the autogenerated Pulumi name from another deployment (calico-typha) in order to start.

We noticed that when targeting a 1.18 cluster, pulumi-language-python-exec would take 100% CPU and that pulumi up now took 35 minutes instead of 30 seconds. I then spun up a 1.18 KinD cluster with serversideapply set to false, and Pulumi went back to only taking 30 seconds. I then targeted a 1.17 EKS cluster and it worked fine, so its definitely something with the ServerSideApply feature.

I patched pulumi-language-python-exec to support CProfile (top cumulative time results here) and a debugger and took a look at what it was doing, and noticed that it was overwhelingly spending its time creating Output objects as it struggled to parse every managedField as an Input.

From my untrained eye (and I could be talking completely out my ass here), it seems like the managedFields dictionary is incredibly expensive to parse, as it is a huge dictionary containing other dictionaries. Here are logs from the Pulumi engine itself showing a ton of Outputs being made for the calico-typha deployment, where its recursively parsing the entire dictionary and every step it takes makes a new Output. From what I can tell on the Python side, this is then causing a ton of from_input calls where each recursive step is exploding a dictionary.

I think part of the problem may come from this workaround, we have a long standing issue where we have to do a pulumi.Output.all on resource.id and resource.metadata to make sure that the "name" of the kubernetes resource is Known even when the Kubernetes cluster doesn't exist yet. #906 has more details.

Reproducing the issue

https://github.com/jeid64/pulumi-provider-bug/blob/118bug/__main__.py
The 118bug branch contains a full repro. Instead of deploying calico, it will deploy "nginx" as the container image, using the rest of the manifest we would use for Calico. If you point it at a 1.18 EKS cluster you'll see it takes 35 minutes for pulumi up to complete, and on 1.17 30 seconds. It's just 3 kubernetes resources, let me know if theres any question or confusion.

Suggestions for a fix

Is there anyway to ignore the managedFields dictionary from the Go provider before it sends it to the Python part of the provider? I know there is upcoming support in pulumi-kubernetes to use it for dry run support, dunno if ignoring the field would cause problems for that.

@jeid64
Copy link
Author

jeid64 commented Nov 10, 2020

@pgavlin @lblackstone related to your work with server side apply?

@lblackstone
Copy link
Member

Thanks for the detailed bug report. We've fixed performance problems in the past related to deeply nested Outputs, so I'll bet you're right about it being related to the managedFields. I'll take a look at it this week.

@lblackstone lblackstone self-assigned this Nov 10, 2020
@lblackstone lblackstone added impact/performance Something is slower than expected impact/usability Something that impacts users' ability to use the product easily and intuitively labels Nov 10, 2020
@lblackstone lblackstone added this to the current milestone Nov 10, 2020
@jeid64
Copy link
Author

jeid64 commented Jan 12, 2021

@lblackstone any updates on this? thanks!

@lblackstone
Copy link
Member

Sorry, I got pulled off on other work, and haven't had a chance to come back to this yet. It's still on my TODO list.

@leezen leezen removed this from the current milestone Jan 31, 2021
@jeid64
Copy link
Author

jeid64 commented May 5, 2021

Any updates on this ticket? We've pretty much stopped using .apply because of it.

@lblackstone lblackstone added the dry-run-diff Related to dry run diff behavior label May 5, 2021
@lblackstone
Copy link
Member

Sorry for the delay on this. I'll take a look as part of #1556

@lblackstone
Copy link
Member

I suspect this might be the same underlying issue as #1597

lukehoban pushed a commit to pulumi/pulumi that referenced this issue Jun 1, 2021
These mutually recursive functions unintentionally had exponential complexity in nesting depth of objects, arg types and most likely arrays.

Remove the exponential complexity by avoiding direct recursion of from_input on itself, and relying on mutual recursion with all alone to reduce nested substructure.

Also simplify the implementation to aid readability.

Fixes pulumi/pulumi-kubernetes#1597.
Fixes pulumi/pulumi-kubernetes#1425.
Fixes pulumi/pulumi-kubernetes#1372.
Fixes #3987.
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label Jun 1, 2021
@lukehoban lukehoban added this to the 0.57 milestone Jun 1, 2021
@lukehoban lukehoban added the kind/bug Some behavior is incorrect or out of spec label Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dry-run-diff Related to dry run diff behavior impact/performance Something is slower than expected impact/usability Something that impacts users' ability to use the product easily and intuitively kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants