Use shared informer for await logic for deployments #1639

viveklak · 2021-06-26T08:04:35Z

Builds on #1634.

Note - the current iteration launches an informer per deployment instead of having a provider level informer. This is still a pretty significant improvement since all replicasets, pods, pvcs associated with a deployment set use the same informer.

I ran several dozen updates with this and various scenarios. I didn't see any throttles (and if they occurred they were handled gracefully).

We also get the behavior we want for #1502 - since the informer will feed the initial read as an event to the respective channels and we don't miss deployments/replicaset scale ups.

I will follow up with a separate PR for some additional unit tests to cover some more scenarios which might require a little bit of refactoring here.

github-actions · 2021-06-26T08:09:37Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

github-actions · 2021-06-26T08:11:17Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

github-actions · 2021-06-26T18:46:54Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

github-actions · 2021-06-26T19:38:02Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

github-actions · 2021-06-27T04:31:28Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

mikhailshilkov · 2021-06-28T13:58:41Z

A couple of fundamental questions before commenting on details:

Given our previous need to flip back to the previous await logic in a patch release, is there a way we could avoid doing so if this new code goes sideways? Can we add a flag to turn it off? Can we combine both methods somehow?
I noticed that the TF provider awaits with polling, at least for deployment. In general, I'm bias towards polling because it's much easier to reason about compared to event handling. Is there a reason we can't use polling here?

lblackstone · 2021-06-28T15:22:56Z

In general, I'm bias towards polling because it's much easier to reason about compared to event handling. Is there a reason we can't use polling here?

Kubernetes is pretty well optimized for event handling, and that's generally the recommended approach for clients. I'd prefer to abstract on the client side if we decide to change to a polling approach for await logic. i.e., continue subscribing to events, and have await logic poll a cache if necessary.

mikhailshilkov · 2021-06-28T15:29:35Z

I'm less concerned about Kubernetes not being able to provide events or something, but rather about the consequences of a lost/delayed/wrong-sequenced/duplicate event throwing off our client logic and potentially leading to stuck updates, while being hard to test.

Is poll bad in terms of performance? Or what are some concerns about it?

lblackstone · 2021-06-28T15:26:07Z

provider/pkg/await/deployment.go

+	informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
+		AddFunc: func(obj interface{}) {
+			informChan <- watch.Event{
+				Object: obj.(*unstructured.Unstructured),


We should probably do a defensive type assertion here to avoid panics in case the object is the wrong type somehow.

lblackstone · 2021-06-28T15:28:37Z

provider/pkg/provider/provider.go

@@ -63,7 +53,16 @@ import (
 	"k8s.io/client-go/tools/clientcmd"
 	clientapi "k8s.io/client-go/tools/clientcmd/api"
 	k8sopenapi "k8s.io/kubectl/pkg/util/openapi"
+	"net/http"


nit: We've been using goimports to format the k8s provider code. It looks like we may be using different toolchains, so we should pick one and standardize on it.

I default to gofmt and assumed make lint would catch it. Sounds good, I will run goimports.

lblackstone · 2021-06-28T15:32:54Z

provider/pkg/await/deployment.go

 		return
 	}

 	// Start over, prove that rollout is complete.
 	dia.deploymentErrors = map[string]string{}

 	// Do nothing if this is not the Deployment we're waiting for.
-	if deployment.GetName() != inputDeploymentName {
+	if deployment.GetName() != inputDeploymentName || deployment.GetNamespace() != dia.config.currentInputs.GetNamespace() {


Since the informer is already filtering by namespace, is this necessary?

Yup - makes sense. Will remove.

lblackstone · 2021-06-28T15:35:50Z

provider/pkg/await/await.go

@@ -215,6 +217,7 @@ func Creation(c CreateConfig) (*unstructured.Unstructured, error) {
 					urn:               c.URN,
 					initialAPIVersion: c.InitialAPIVersion,
 					clientSet:         c.ClientSet,
+					informerFactory:   dynamicinformer.NewFilteredDynamicSharedInformerFactory(c.ClientSet.GenericClient, 5*time.Second, c.Inputs.GetNamespace(), nil),


I'm not sure yet if we want to set the refresh period on the informers, and if so, what the period should be. This will need some more testing/tuning.

lblackstone · 2021-06-28T15:38:18Z

provider/pkg/await/deployment.go

+	stopper := make(chan struct{})
+	defer close(stopper)
+
+	// Limit the lifetime of this to each deployment await for now. We can reduce this sharing further later.


Any particular reason for initializing a factory per await? Seems like we could share the factory across the provider without much difference in level of effort.

Any particular reason for initializing a factory per await? Seems like we could share the factory across the provider without much difference in level of effort.

One thing was to set the lifetime of the factory itself (the stop channel to pass to the Start call in the next line). This way we know it's safe to kill the informer once an individual deployment has been waited on. A shared informer for the entire provider would need some degree of coordination to make sure all the deployments had completed etc. which I didn't want to layer on yet.

Ok. That sounds reasonable as a first step, but presumably doesn't buy us anything in terms of performance since we aren't sharing the cache between resources?

lblackstone · 2021-06-28T15:54:50Z

I'm less concerned about Kubernetes not being able to provide events or something, but rather about the consequences of a lost/delayed/wrong-sequenced/duplicate event throwing off our client logic and potentially leading to stuck updates, while being hard to test.

I'm not 100% sure that this is the case, but I believe the await logic should be idempotent. We already support resuming a previous update, which isn't guaranteed to include the full sequence of events. At least for Job and Pod, a single event is sufficient to declare that the resource is ready, so it's not sensitive to ordering/duplicates. We should verify that this is the case for all awaiters, but I'm pretty sure it is.

Is poll bad in terms of performance? Or what are some concerns about it?

Yes, polling can cause performance problems for k8s clusters at scale. It's not that polling would not work for our use case, but that watch is preferred to polling in general for k8s.

viveklak · 2021-06-28T18:07:10Z

I'm less concerned about Kubernetes not being able to provide events or something, but rather about the consequences of a lost/delayed/wrong-sequenced/duplicate event throwing off our client logic and potentially leading to stuck updates, while being hard to test.

I think this is a valid concern. That said, a watch with a reasonably spaced resync interval essentially mimics the poll behavior as a fallback. With this approach we are making it much harder to miss an event.

To be clear, the change that we reverted wasn't wrong. We theorize it made the likelihood of hitting an API server-side throttle a bit higher but we are still definitely hitting those throttling events regardless of that change.

IMO this is a longstanding issue. I don't know why it is becoming more prevalent now. Perhaps newer api servers are more aggressive on throttling or cloud providers have dialed these up? While I am still working through testing this, I have already seen the informer model handle throttling a lot better.

mikhailshilkov · 2021-06-28T20:24:21Z

We theorize it made the likelihood of hitting an API server-side throttle a bit higher but we are still definitely hitting those throttling events regardless of that change.

What is the failure mode when we hit throttling? Why does it lead to stuck updates as opposed to just slower updates? Any good places for me to read about this?

Thank you for bearing with my noob questions!

viveklak · 2021-06-28T20:52:21Z

We theorize it made the likelihood of hitting an API server-side throttle a bit higher but we are still definitely hitting those throttling events regardless of that change.

What is the failure mode when we hit throttling? Why does it lead to stuck updates as opposed to just slower updates? Any good places for me to read about this?

Thank you for bearing with my noob questions!

Not a problem. The tight loop is just a straight up bug. https://github.com/pulumi/pulumi-kubernetes/blob/master/provider/pkg/await/deployment.go#L314
ResultChan() can be closed if there is an error, this could be a throttle or a network blip. As a result, we keep reading from a closed channel. There is a low-level backoff protocol embedded in some wrappers, e.g.: https://github.com/kubernetes/client-go/blob/v0.21.2/tools/watch/retrywatcher.go#L189 but informer handles all this internally.

github-actions · 2021-06-29T00:00:04Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

github-actions · 2021-06-29T06:11:49Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

viveklak · 2021-06-29T16:20:47Z

@lblackstone @mikhailshilkov I think this is ready for another look. In my tests things seemed much more stable with this.

provider/pkg/await/await.go

github-actions · 2021-06-29T22:14:23Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

github-actions · 2021-06-30T07:19:24Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

lblackstone · 2021-06-30T16:57:05Z

provider/pkg/await/deployment.go

+			Version:  "v1beta1",
+			Resource: "deployments",
+		}, deploymentEvents)
+	go deploymentV1Beta1Informer.Informer().Run(stopper)


I'm not sure if we need Informers for the old apiVersions. I believe the watch clients were previously only using the latest apiVersions.

Ah interesting. Does that mean that we don't really support the v1beta1 etc. variants? It seems like we load the latest api versions when creating clients and use them to create watches. I can't really verify this since all the cloud providers seem to have stopped supporting 1.15 or older (bunch of these were killed in 1.16).

Caught up with @lblackstone offline. It seems we should be safe here. Removed non-v1 informer variants.

github-actions · 2021-06-30T21:00:40Z

Does the PR have any schema changes?

Looking good! No breaking changes found.
No new resources/functions.

lblackstone and others added 2 commits June 25, 2021 10:33

Use SharedInformers instead of watch clients

4514424

Use shared informer for each deployment and downstream resource

70a08ef

viveklak requested a review from lblackstone June 26, 2021 08:04

viveklak changed the title ~~Use shared informer for each deployment and downstream resource~~ Use shared informer for await logic for deployments Jun 26, 2021

Update changelog

6b0cd53

Fix tests and linting

5891979

viveklak requested a review from mikhailshilkov June 26, 2021 18:50

Limit to the same namespace

bacead7

viveklak mentioned this pull request Jun 27, 2021

pulumi up stuck on "Waiting for app ReplicaSet be marked available" #1628

Closed

lblackstone reviewed Jun 28, 2021

View reviewed changes

Handle default namespace

c2ae9e0

Fix imports

e459145

lblackstone reviewed Jun 29, 2021

View reviewed changes

provider/pkg/await/await.go Outdated Show resolved Hide resolved

Remove informer factory from await config

3665e63

Add a common resync period to allow catch-up polling

d74e0dd

lblackstone approved these changes Jun 30, 2021

View reviewed changes

Remove watchers for older GVR variations

3a75eb0

viveklak merged commit 059df9f into master Jun 30, 2021

pulumi-bot deleted the vl/shared-informers branch June 30, 2021 21:27

renovate bot mentioned this pull request Jul 1, 2021

fix(deps): update dependency @pulumi/pulumi to v3.148.0 sourcegraph/deploy-k8s-helper#80

Open

1 task

frieltwochairs mentioned this pull request Sep 8, 2021

Easily Closed Issue #1703

Closed

AaronFriel mentioned this pull request Sep 8, 2021

Configurable Kubernetes Client Throttling/Rate Limiting #1704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use shared informer for await logic for deployments #1639

Use shared informer for await logic for deployments #1639

viveklak commented Jun 26, 2021 •

edited

Loading

github-actions bot commented Jun 26, 2021

github-actions bot commented Jun 26, 2021

github-actions bot commented Jun 26, 2021

github-actions bot commented Jun 26, 2021

github-actions bot commented Jun 27, 2021

mikhailshilkov commented Jun 28, 2021

lblackstone commented Jun 28, 2021

mikhailshilkov commented Jun 28, 2021

lblackstone Jun 28, 2021

lblackstone Jun 28, 2021

viveklak Jun 28, 2021

lblackstone Jun 28, 2021

viveklak Jun 28, 2021

lblackstone Jun 28, 2021

lblackstone Jun 28, 2021

viveklak Jun 29, 2021

lblackstone Jun 29, 2021

lblackstone commented Jun 28, 2021 •

edited

Loading

viveklak commented Jun 28, 2021

mikhailshilkov commented Jun 28, 2021

viveklak commented Jun 28, 2021 •

edited

Loading

github-actions bot commented Jun 29, 2021

github-actions bot commented Jun 29, 2021

viveklak commented Jun 29, 2021

github-actions bot commented Jun 29, 2021

github-actions bot commented Jun 30, 2021

lblackstone Jun 30, 2021

viveklak Jun 30, 2021

viveklak Jun 30, 2021

github-actions bot commented Jun 30, 2021

Use shared informer for await logic for deployments #1639

Use shared informer for await logic for deployments #1639

Conversation

viveklak commented Jun 26, 2021 • edited Loading

github-actions bot commented Jun 26, 2021

Does the PR have any schema changes?

github-actions bot commented Jun 26, 2021

Does the PR have any schema changes?

github-actions bot commented Jun 26, 2021

Does the PR have any schema changes?

github-actions bot commented Jun 26, 2021

Does the PR have any schema changes?

github-actions bot commented Jun 27, 2021

Does the PR have any schema changes?

mikhailshilkov commented Jun 28, 2021

lblackstone commented Jun 28, 2021

mikhailshilkov commented Jun 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lblackstone commented Jun 28, 2021 • edited Loading

viveklak commented Jun 28, 2021

mikhailshilkov commented Jun 28, 2021

viveklak commented Jun 28, 2021 • edited Loading

github-actions bot commented Jun 29, 2021

Does the PR have any schema changes?

github-actions bot commented Jun 29, 2021

Does the PR have any schema changes?

viveklak commented Jun 29, 2021

github-actions bot commented Jun 29, 2021

Does the PR have any schema changes?

github-actions bot commented Jun 30, 2021

Does the PR have any schema changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 30, 2021

Does the PR have any schema changes?

viveklak commented Jun 26, 2021 •

edited

Loading

lblackstone commented Jun 28, 2021 •

edited

Loading

viveklak commented Jun 28, 2021 •

edited

Loading