cache.WaitForCacheSync may never exit on shutdown #41

cmacknz · 2023-02-02T18:23:20Z

This was discovered as the root cause of intermittent failures in the ECK operator integration tests for Fleet. See elastic/cloud-on-k8s#6331 (comment) for the logs we get when this happens. Diagnostics including a goroutine dump are attached.

Contents of the agent's state.yaml when this happens:

components: []
log_level: info
message: context canceled
state: 4

fleet-server-deadlock-diagnostics.tar.gz

The coordinator seems to be stuck at:

1 @ 0x559a9083e256 0x559a9080c1cc 0x559a9080bbf8 0x559a90ffd2e5 0x559a90ff9c05 0x559a91cb4617 0x559a9086ec61
#	0x559a90ffd2e4	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).runner+0x1084	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go:568
#	0x559a90ff9c04	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).Run+0x164		github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go:399
#	0x559a91cb4616	github.com/elastic/elastic-agent/internal/pkg/agent/cmd.run.func6+0x36						github.com/elastic/elastic-agent/internal/pkg/agent/cmd/run.go:220

This is the relevant code block:
https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/agent/application/coordinator/coordinator.go#L552-L568

The text was updated successfully, but these errors were encountered:

cmacknz · 2023-02-02T18:37:00Z

I have a suspicion that we are waiting for the Kubernetes provider to exit:


1 @ 0x559a9083e256 0x559a9084f037 0x559a9084f011 0x559a9086a925 0x559a90886932 0x559a9103fdc8 0x559a90ffd649 0x559a9086ec61
#	0x559a9086a924	sync.runtime_Semacquire+0x24											runtime/sema.go:56
#	0x559a90886931	sync.(*WaitGroup).Wait+0x51											sync/waitgroup.go:136
#	0x559a9103fdc7	github.com/elastic/elastic-agent/internal/pkg/composable.(*controller).Run+0xa87				github.com/elastic/elastic-agent/internal/pkg/composable/controller.go:177
#	0x559a90ffd648	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).runner.func3+0x48	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go:556

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/controller.go#L162-L178

1 @ 0x559a9083e256 0x559a9084df52 0x559a919b0d25 0x559a919b0bda 0x559a919b0ae9 0x559a919b0a1c 0x559a91baa2ee 0x559a91bd25e5 0x559a91bdd30f 0x559a91bd9e7e 0x559a91bd967c 0x559a910403ea 0x559a9086ec61
#	0x559a919b0d24	k8s.io/apimachinery/pkg/util/wait.WaitForWithContext+0xe4								k8s.io/apimachinery@v0.23.4/pkg/util/wait/wait.go:658
#	0x559a919b0bd9	k8s.io/apimachinery/pkg/util/wait.poll+0x99										k8s.io/apimachinery@v0.23.4/pkg/util/wait/wait.go:594
#	0x559a919b0ae8	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext+0x48							k8s.io/apimachinery@v0.23.4/pkg/util/wait/wait.go:545
#	0x559a919b0a1b	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil+0x7b								k8s.io/apimachinery@v0.23.4/pkg/util/wait/wait.go:536
#	0x559a91baa2ed	k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d									k8s.io/client-go@v0.23.4/tools/cache/shared_informer.go:255
#	0x559a91bd25e4	github.com/elastic/elastic-agent-autodiscover/kubernetes.(*watcher).Start+0x164						github.com/elastic/elastic-agent-autodiscover@v0.2.1/kubernetes/watcher.go:185
#	0x559a91bdd30e	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*pod).Start+0x2e				github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/pod.go:123
#	0x559a91bd9e7d	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*dynamicProvider).watchResource+0x4dd	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/kubernetes.go:141
#	0x559a91bd967b	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*dynamicProvider).Run+0xfb		github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/kubernetes.go:66
#	0x559a910403e9	github.com/elastic/elastic-agent/internal/pkg/composable.(*controller).Run.func2+0xa9					github.com/elastic/elastic-agent/internal/pkg/composable/controller.go:141

That stack trace leads us into this function in the autodiscover library:

elastic-agent-autodiscover/kubernetes/watcher.go

Lines 185 to 187 in 4369c6b

    
           if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) { 
        
           	return fmt.Errorf("kubernetes informer unable to sync cache") 
        
           }

	if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
		return fmt.Errorf("kubernetes informer unable to sync cache")
	}

cmacknz · 2023-02-02T19:19:31Z

Looking at the Kubernetes provider stack trace above:

#	0x559a91baa2ed	k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d									k8s.io/client-go@v0.23.4/tools/cache/shared_informer.go:255
#	0x559a91bd25e4	github.com/elastic/elastic-agent-autodiscover/kubernetes.(*watcher).Start+0x164						github.com/elastic/elastic-agent-autodiscover@v0.2.1/kubernetes/watcher.go:185
#	0x559a91bdd30e	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*pod).Start+0x2e				github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/pod.go:123
#	0x559a91bd9e7d	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*dynamicProvider).watchResource+0x4dd	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/kubernetes.go:141
#	0x559a91bd967b	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*dynamicProvider).Run+0xfb		github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/kubernetes.go:66
#	0x559a910403e9	github.com/elastic/elastic-agent/internal/pkg/composable.(*controller).Run.func2+0xa9					github.com/elastic/elastic-agent/internal/pkg/composable/controller.go:141

I think we might lose the context, specifically I don't see us selecting or checking the Done channel of the context embedded in the composable.DynamicProviderComm struct.

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/dynamic.go#L16-L18

I see this entire section to create a K8S client and eventually start a Pod eventer but I don't see the context used or anything that would tie the lifetime of requests or goroutines to the parent context.Context:

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/providers/kubernetes/kubernetes.go#L104-L144

We eventually call the pod watcher Start method which is in the stack trace:

#	0x559a91bd25e4	github.com/elastic/elastic-agent-autodiscover/kubernetes.(*watcher).Start+0x164						github.com/elastic/elastic-agent-autodiscover@v0.2.1/kubernetes/watcher.go:185

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/providers/kubernetes/pod.go#L120-L123

I don't see the node watcher being tied to the parent context anywhere:

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/providers/kubernetes/pod.go#L69-L78

I suspect that when the agent tries to exit and cancels the context of the dynamic variable provider here it never propagates into the Pod watch object started here.

@ChrsMark or @MichaelKatsoulis I see both your names in the commit log in this area, does this seem like a possible explanation to you for why the K8S provider would deadlock the agent shutdown sequence? Any ideas on the best possible fix?

cmacknz · 2023-02-02T19:23:01Z

It seems like the Pod watcher has a context internally and we just don't set it:

elastic-agent-autodiscover/kubernetes/watcher.go

Lines 181 to 187 in 4369c6b

    
           // Start watching pods 
        
           func (w *watcher) Start() error { 
        
           	go w.informer.Run(w.ctx.Done()) 
        
           	if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) { 
        
           		return fmt.Errorf("kubernetes informer unable to sync cache") 
        
           	}

// Start watching pods
func (w *watcher) Start() error {
	go w.informer.Run(w.ctx.Done())

	if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
		return fmt.Errorf("kubernetes informer unable to sync cache")
	}

Yep we are using context.TODO() in the autodiscover package.

elastic-agent-autodiscover/kubernetes/watcher.go

Lines 128 to 134 in 4369c6b

    
           ctx, cancel := context.WithCancel(context.TODO()) 
        
           w := &watcher{ 
        
           	client:   client, 
        
           	informer: informer, 
        
           	store:    store, 
        
           	queue:    queue, 
        
           	ctx:      ctx,

	ctx, cancel := context.WithCancel(context.TODO())
	w := &watcher{
		client:   client,
		informer: informer,
		store:    store,
		queue:    queue,
		ctx:      ctx,

I believe we need to expose the ability to pass in the ctx here when creating the watcher, and then update every watcher to use it (not just the Pod watcher).

ChrsMark · 2023-02-06T03:46:09Z

Thanks for all the research here @cmacknz! I had a quick look into this and what you suggest as a change looks valid in general.

However I was wondering how it would be possible a go routine (provider/watcher etc) to make the whole program to get stuck? The provider in general should not be waited in case the controller/Agent wants to exit. Or this behavior is expected based on Agent's architecture?

cmacknz · 2023-02-06T14:40:31Z

Or this behavior is expected based on Agent's architecture?

Yes in that the agent is explicitly waiting for the providers to exit gracefully in this sequence, see in particularly that it is waiting on the varsErr channel to get the exit code:

https://github.com/elastic/elastic-agent/blob/fdd1465eb0551eed52130810e52859658b5367f9/internal/pkg/agent/application/coordinator/coordinator.go#L554-L568

However I was wondering how it would be possible a go routine (provider/watcher etc) to make the whole program to get stuck?

The decision will ultimately come down to how much the agent should trust the providers. The agent doesn't trust inputs or integrations to behave reasonably at all as they are external to the agent, and will take the approach of force terminating them after 30 seconds.

Since the providers are built directly into the agent itself, I think we should trust them to behave reasonably and assume they respect cancellation signals. In general it is a best practice (if not an actual necessity) to have the lifetime of each goroutine and network request tied to a cancellation signal, either for teardown or timeouts.

We could equivalently fix this deadlock by making the agent give up on waiting for providers for a set period of time, but I would worry that is just hiding some other problem. For example, a provider might have API, session, or file resources it should clean up on shutdown instead of just abandoning them. I currently strongly prefer fixing the Kubernetes provider cancellation mechanism here instead of just ignoring it to ensure we aren't hiding another lower level problem, or allowing some future provider to misbehave in the future by not requiring a clean shutdown.

ChrsMark · 2023-02-08T07:40:11Z

Thanks for the explanation @cmacknz! I agree we need to ensure the graceful stop of the kubernetes provider.

ChrsMark · 2023-02-08T09:12:52Z

It seems like the Pod watcher has a context internally and we just don't set it:

elastic-agent-autodiscover/kubernetes/watcher.go

Lines 181 to 187 in 4369c6b

// Start watching pods

func (w *watcher) Start() error {

go w.informer.Run(w.ctx.Done())

if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {

return fmt.Errorf("kubernetes informer unable to sync cache")

}
// Start watching pods
func (w *watcher) Start() error {
	go w.informer.Run(w.ctx.Done())

	if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
		return fmt.Errorf("kubernetes informer unable to sync cache")
	}
Yep we are using context.TODO() in the autodiscover package.

elastic-agent-autodiscover/kubernetes/watcher.go

Lines 128 to 134 in 4369c6b

ctx, cancel := context.WithCancel(context.TODO())

w := &watcher{

client: client,

informer: informer,

store: store,

queue: queue,

ctx: ctx,

But we still properly close the channel by calling the cancel()/stop() at

elastic-agent-autodiscover/kubernetes/watcher.go

Line 202 in 4369c6b

w.stop()

, no?

At
https://github.com/elastic/elastic-agent/blob/e7419ddc4e492e7e41acd20ec39f570937affe13/internal/pkg/composable/providers/kubernetes/kubernetes.go#L94
all the eventers and hence their watchers will be stopped. And when a watcher is stopped the cancel/stop is called so I expect that the informer is stopped too 🤔

@cmacknz did you manage to reproduce it? Could you share the steps? I would like to troubleshoot it further.

cmacknz · 2023-02-08T16:27:16Z

The ECK team originally caught this in their integration tests. It was not 100% reproducible. I would get in touch with them. Easiest way is probably to just follow up in the issue on the ECK side where we were originally diagnosing this elastic/cloud-on-k8s#6331 (comment)

cmacknz · 2023-02-08T16:28:33Z

We eventually end up stuck in cache.WaitForCacheSync so perhaps the problem is there specifically.

0x559a91baa2ed k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d

ChrsMark · 2023-02-09T05:15:25Z

We eventually end up stuck in cache.WaitForCacheSync so perhaps the problem is there specifically.

0x559a91baa2ed k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d

That looks weird. I cannot see why the WaitForCacheSync to be stalled exactly when the the Agent/provider/watcher is to be closed.
This function is supposed to just return after the caches are populated. Source code: WaitForCacheSync
And then the watcher will actually start doing the watch process at

elastic-agent-autodiscover/kubernetes/watcher.go

Line 189 in 4369c6b

w.logger.Debugf("cache sync done")

.
If the WaitForCacheSync was stalled the watcher wouldn't do the watch process at all. Means no discovered Pods. Is this true for their use-case?
I'm not sure how this result happened tbh but it if WaitForCacheSync is stalled I think the autodiscovery is not working at all and it's no matter of the close/stop process.
@gizas do you have any other ideas/thoughts here?

gizas · 2023-02-24T15:44:36Z

I tried to follow the flow from the beginning so I might repeat things said above (sorry in advance):

Application.go starts coordinator and composable (https://github.com/elastic/elastic-agent/blob/fdd1465eb0551eed52130810e52859658b5367f9/internal/pkg/agent/application/application.go#L125-L130)
Composable package includes all providers
Coordinator starts channels for 3 Config Managers. Composable is the VarsManager Manager. So indeed we start a channel and we pass to it a ctx withCancel https://github.com/elastic/elastic-agent/blob/fdd1465eb0551eed52130810e52859658b5367f9/internal/pkg/agent/application/coordinator/coordinator.go#L556
On the composable side we first call the https://github.com/elastic/elastic-agent/blob/main/internal/pkg/composable/controller.go#L49
This New function creates the controller with all providers but I dont see contexts from cooridinaror or viseversa being exchanged here? I think this where we miss something.

When we want to run agent shutdown, this means that coordinator should terminate a context to pass this to composable is not it? But we dont have this common context now.

0x559a91baa2ed k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d

If this is stalled I agree that the autodiscovery will completely fail. Only case I can think is that we try to close this thread while is still running.

gizas · 2023-02-28T13:47:09Z

@cmacknz I will refine my above answer, with following flow:

Coodinator runs the provider and passes ctx -----> We create another localCtx context (based on previous ctx )which is passed to dynamic Provider -----> For eg. for kubernetes.go provider we will stop it only when comm.Done(https://github.com/elastic/elastic-agent/blob/main/internal/pkg/composable/providers/kubernetes/kubernetes.go#L92) is received.

So I am not sure if we send such signal by this https://github.com/elastic/elastic-agent/blob/fdd1465eb0551eed52130810e52859658b5367f9/internal/pkg/agent/application/coordinator/coordinator.go#L557

We can not replicate it at the moment. Do you have any ideas how we can replicate such scenario in kubernetes? Can somehow stop agent and this might trigger it?

cmacknz · 2023-02-28T19:05:55Z

The ECK team has integration tests that experience this so often they had to disable the tests. See elastic/cloud-on-k8s#6331 (comment) for example. I would ask the ECK team how to set up their test environment, you can probably just reply to that issue.

The stack trace here isn't lying to us about where the agent is stuck, although it might not be exactly obvious why. My suspicion is still the same as inn #41 I can see that elastic-agent-autodiscover's watcher.go is using a placeholder context that was not passed in by the caller, and therefore can never be cancelled or time out.

cmacknz · 2023-03-21T17:40:20Z

We changed the agent so that it will no longer wait forever for the dynamic providers to exit in elastic/elastic-agent#2352.

I've moved this into the autodiscover repository and reworded the description to specify that this was specifically about cache.WaitForCacheSync not being cancelled on shutdown.

gizas · 2024-01-12T15:32:18Z

Closing this as per #41 (comment)

cmacknz added bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.6.0 labels Feb 2, 2023

cmacknz mentioned this issue Feb 2, 2023

TestFleet* fail on 8.6.x elastic/cloud-on-k8s#6331

Closed

cmacknz added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Feb 2, 2023

naemono mentioned this issue Feb 15, 2023

ECK operator gets 401 Unauthorized when trying to setup Fleet Server elastic/cloud-on-k8s#6144

Open

This was referenced Mar 6, 2023

Reduce deadlock in controller elastic/elastic-agent#2349

Merged

Fix deadlocks in agent startup process elastic/elastic-agent#2352

Merged

cmacknz transferred this issue from elastic/elastic-agent Mar 21, 2023

cmacknz changed the title ~~The agent can deadlock on shutdown when running on Kubernetes~~ Remove use of context.TODO() to ensure all requests can be cancelled Mar 21, 2023

cmacknz changed the title ~~Remove use of context.TODO() to ensure all requests can be cancelled~~ cache.WaitForCacheSync may never exit on shutdown Mar 21, 2023

ChrsMark mentioned this issue May 17, 2023

All k8s watchers should get stopped when k8s processor is stopped elastic/beats#35495

Closed

ChrsMark mentioned this issue May 31, 2023

Make sure k8s watchers close when closing k8s meta processor elastic/beats#35630

Merged

gizas closed this as completed Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache.WaitForCacheSync may never exit on shutdown #41

cache.WaitForCacheSync may never exit on shutdown #41

cmacknz commented Feb 2, 2023 •

edited

Loading

cmacknz commented Feb 2, 2023

cmacknz commented Feb 2, 2023 •

edited

Loading

cmacknz commented Feb 2, 2023

ChrsMark commented Feb 6, 2023

cmacknz commented Feb 6, 2023

ChrsMark commented Feb 8, 2023

ChrsMark commented Feb 8, 2023 •

edited

Loading

cmacknz commented Feb 8, 2023

cmacknz commented Feb 8, 2023

ChrsMark commented Feb 9, 2023

gizas commented Feb 24, 2023

gizas commented Feb 28, 2023

cmacknz commented Feb 28, 2023 •

edited

Loading

cmacknz commented Mar 21, 2023

gizas commented Jan 12, 2024

cache.WaitForCacheSync may never exit on shutdown #41

cache.WaitForCacheSync may never exit on shutdown #41

Comments

cmacknz commented Feb 2, 2023 • edited Loading

cmacknz commented Feb 2, 2023

cmacknz commented Feb 2, 2023 • edited Loading

cmacknz commented Feb 2, 2023

ChrsMark commented Feb 6, 2023

cmacknz commented Feb 6, 2023

ChrsMark commented Feb 8, 2023

ChrsMark commented Feb 8, 2023 • edited Loading

cmacknz commented Feb 8, 2023

cmacknz commented Feb 8, 2023

ChrsMark commented Feb 9, 2023

gizas commented Feb 24, 2023

gizas commented Feb 28, 2023

cmacknz commented Feb 28, 2023 • edited Loading

cmacknz commented Mar 21, 2023

gizas commented Jan 12, 2024

cmacknz commented Feb 2, 2023 •

edited

Loading

cmacknz commented Feb 2, 2023 •

edited

Loading

ChrsMark commented Feb 8, 2023 •

edited

Loading

cmacknz commented Feb 28, 2023 •

edited

Loading