Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache.WaitForCacheSync may never exit on shutdown #41

Closed
cmacknz opened this issue Feb 2, 2023 · 15 comments
Closed

cache.WaitForCacheSync may never exit on shutdown #41

cmacknz opened this issue Feb 2, 2023 · 15 comments
Labels
bug Something isn't working Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team Team:Elastic-Agent Label for the Agent team v8.6.0

Comments

@cmacknz
Copy link
Member

cmacknz commented Feb 2, 2023

This was discovered as the root cause of intermittent failures in the ECK operator integration tests for Fleet. See elastic/cloud-on-k8s#6331 (comment) for the logs we get when this happens. Diagnostics including a goroutine dump are attached.

Contents of the agent's state.yaml when this happens:

components: []
log_level: info
message: context canceled
state: 4

fleet-server-deadlock-diagnostics.tar.gz

The coordinator seems to be stuck at:

1 @ 0x559a9083e256 0x559a9080c1cc 0x559a9080bbf8 0x559a90ffd2e5 0x559a90ff9c05 0x559a91cb4617 0x559a9086ec61
#	0x559a90ffd2e4	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).runner+0x1084	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go:568
#	0x559a90ff9c04	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).Run+0x164		github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go:399
#	0x559a91cb4616	github.com/elastic/elastic-agent/internal/pkg/agent/cmd.run.func6+0x36						github.com/elastic/elastic-agent/internal/pkg/agent/cmd/run.go:220

This is the relevant code block:
https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/agent/application/coordinator/coordinator.go#L552-L568

@cmacknz cmacknz added bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.6.0 labels Feb 2, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Feb 2, 2023

I have a suspicion that we are waiting for the Kubernetes provider to exit:


1 @ 0x559a9083e256 0x559a9084f037 0x559a9084f011 0x559a9086a925 0x559a90886932 0x559a9103fdc8 0x559a90ffd649 0x559a9086ec61
#	0x559a9086a924	sync.runtime_Semacquire+0x24											runtime/sema.go:56
#	0x559a90886931	sync.(*WaitGroup).Wait+0x51											sync/waitgroup.go:136
#	0x559a9103fdc7	github.com/elastic/elastic-agent/internal/pkg/composable.(*controller).Run+0xa87				github.com/elastic/elastic-agent/internal/pkg/composable/controller.go:177
#	0x559a90ffd648	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).runner.func3+0x48	github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go:556

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/controller.go#L162-L178

1 @ 0x559a9083e256 0x559a9084df52 0x559a919b0d25 0x559a919b0bda 0x559a919b0ae9 0x559a919b0a1c 0x559a91baa2ee 0x559a91bd25e5 0x559a91bdd30f 0x559a91bd9e7e 0x559a91bd967c 0x559a910403ea 0x559a9086ec61
#	0x559a919b0d24	k8s.io/apimachinery/pkg/util/wait.WaitForWithContext+0xe4								k8s.io/apimachinery@v0.23.4/pkg/util/wait/wait.go:658
#	0x559a919b0bd9	k8s.io/apimachinery/pkg/util/wait.poll+0x99										k8s.io/apimachinery@v0.23.4/pkg/util/wait/wait.go:594
#	0x559a919b0ae8	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext+0x48							k8s.io/apimachinery@v0.23.4/pkg/util/wait/wait.go:545
#	0x559a919b0a1b	k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil+0x7b								k8s.io/apimachinery@v0.23.4/pkg/util/wait/wait.go:536
#	0x559a91baa2ed	k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d									k8s.io/client-go@v0.23.4/tools/cache/shared_informer.go:255
#	0x559a91bd25e4	github.com/elastic/elastic-agent-autodiscover/kubernetes.(*watcher).Start+0x164						github.com/elastic/elastic-agent-autodiscover@v0.2.1/kubernetes/watcher.go:185
#	0x559a91bdd30e	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*pod).Start+0x2e				github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/pod.go:123
#	0x559a91bd9e7d	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*dynamicProvider).watchResource+0x4dd	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/kubernetes.go:141
#	0x559a91bd967b	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*dynamicProvider).Run+0xfb		github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/kubernetes.go:66
#	0x559a910403e9	github.com/elastic/elastic-agent/internal/pkg/composable.(*controller).Run.func2+0xa9					github.com/elastic/elastic-agent/internal/pkg/composable/controller.go:141

That stack trace leads us into this function in the autodiscover library:

if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
return fmt.Errorf("kubernetes informer unable to sync cache")
}

	if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
		return fmt.Errorf("kubernetes informer unable to sync cache")
	}

@cmacknz
Copy link
Member Author

cmacknz commented Feb 2, 2023

Looking at the Kubernetes provider stack trace above:

#	0x559a91baa2ed	k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d									k8s.io/client-go@v0.23.4/tools/cache/shared_informer.go:255
#	0x559a91bd25e4	github.com/elastic/elastic-agent-autodiscover/kubernetes.(*watcher).Start+0x164						github.com/elastic/elastic-agent-autodiscover@v0.2.1/kubernetes/watcher.go:185
#	0x559a91bdd30e	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*pod).Start+0x2e				github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/pod.go:123
#	0x559a91bd9e7d	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*dynamicProvider).watchResource+0x4dd	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/kubernetes.go:141
#	0x559a91bd967b	github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes.(*dynamicProvider).Run+0xfb		github.com/elastic/elastic-agent/internal/pkg/composable/providers/kubernetes/kubernetes.go:66
#	0x559a910403e9	github.com/elastic/elastic-agent/internal/pkg/composable.(*controller).Run.func2+0xa9					github.com/elastic/elastic-agent/internal/pkg/composable/controller.go:141

I think we might lose the context, specifically I don't see us selecting or checking the Done channel of the context embedded in the composable.DynamicProviderComm struct.

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/dynamic.go#L16-L18

I see this entire section to create a K8S client and eventually start a Pod eventer but I don't see the context used or anything that would tie the lifetime of requests or goroutines to the parent context.Context:

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/providers/kubernetes/kubernetes.go#L104-L144

We eventually call the pod watcher Start method which is in the stack trace:

#	0x559a91bd25e4	github.com/elastic/elastic-agent-autodiscover/kubernetes.(*watcher).Start+0x164						github.com/elastic/elastic-agent-autodiscover@v0.2.1/kubernetes/watcher.go:185

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/providers/kubernetes/pod.go#L120-L123

I don't see the node watcher being tied to the parent context anywhere:

https://github.com/elastic/elastic-agent/blob/973af90d85dd81aaccfd42a1f81e7ad60f6780db/internal/pkg/composable/providers/kubernetes/pod.go#L69-L78

I suspect that when the agent tries to exit and cancels the context of the dynamic variable provider here it never propagates into the Pod watch object started here.

@ChrsMark or @MichaelKatsoulis I see both your names in the commit log in this area, does this seem like a possible explanation to you for why the K8S provider would deadlock the agent shutdown sequence? Any ideas on the best possible fix?

@cmacknz
Copy link
Member Author

cmacknz commented Feb 2, 2023

It seems like the Pod watcher has a context internally and we just don't set it:

// Start watching pods
func (w *watcher) Start() error {
go w.informer.Run(w.ctx.Done())
if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
return fmt.Errorf("kubernetes informer unable to sync cache")
}

// Start watching pods
func (w *watcher) Start() error {
	go w.informer.Run(w.ctx.Done())

	if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
		return fmt.Errorf("kubernetes informer unable to sync cache")
	}

Yep we are using context.TODO() in the autodiscover package.

ctx, cancel := context.WithCancel(context.TODO())
w := &watcher{
client: client,
informer: informer,
store: store,
queue: queue,
ctx: ctx,

	ctx, cancel := context.WithCancel(context.TODO())
	w := &watcher{
		client:   client,
		informer: informer,
		store:    store,
		queue:    queue,
		ctx:      ctx,

I believe we need to expose the ability to pass in the ctx here when creating the watcher, and then update every watcher to use it (not just the Pod watcher).

@cmacknz cmacknz added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Feb 2, 2023
@ChrsMark
Copy link
Member

ChrsMark commented Feb 6, 2023

Thanks for all the research here @cmacknz! I had a quick look into this and what you suggest as a change looks valid in general.

However I was wondering how it would be possible a go routine (provider/watcher etc) to make the whole program to get stuck? The provider in general should not be waited in case the controller/Agent wants to exit. Or this behavior is expected based on Agent's architecture?

@cmacknz
Copy link
Member Author

cmacknz commented Feb 6, 2023

Or this behavior is expected based on Agent's architecture?

Yes in that the agent is explicitly waiting for the providers to exit gracefully in this sequence, see in particularly that it is waiting on the varsErr channel to get the exit code:

https://github.com/elastic/elastic-agent/blob/fdd1465eb0551eed52130810e52859658b5367f9/internal/pkg/agent/application/coordinator/coordinator.go#L554-L568

However I was wondering how it would be possible a go routine (provider/watcher etc) to make the whole program to get stuck?

The decision will ultimately come down to how much the agent should trust the providers. The agent doesn't trust inputs or integrations to behave reasonably at all as they are external to the agent, and will take the approach of force terminating them after 30 seconds.

Since the providers are built directly into the agent itself, I think we should trust them to behave reasonably and assume they respect cancellation signals. In general it is a best practice (if not an actual necessity) to have the lifetime of each goroutine and network request tied to a cancellation signal, either for teardown or timeouts.

We could equivalently fix this deadlock by making the agent give up on waiting for providers for a set period of time, but I would worry that is just hiding some other problem. For example, a provider might have API, session, or file resources it should clean up on shutdown instead of just abandoning them. I currently strongly prefer fixing the Kubernetes provider cancellation mechanism here instead of just ignoring it to ensure we aren't hiding another lower level problem, or allowing some future provider to misbehave in the future by not requiring a clean shutdown.

@ChrsMark
Copy link
Member

ChrsMark commented Feb 8, 2023

Thanks for the explanation @cmacknz! I agree we need to ensure the graceful stop of the kubernetes provider.

@ChrsMark
Copy link
Member

ChrsMark commented Feb 8, 2023

It seems like the Pod watcher has a context internally and we just don't set it:

// Start watching pods
func (w *watcher) Start() error {
go w.informer.Run(w.ctx.Done())
if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
return fmt.Errorf("kubernetes informer unable to sync cache")
}

// Start watching pods
func (w *watcher) Start() error {
	go w.informer.Run(w.ctx.Done())

	if !cache.WaitForCacheSync(w.ctx.Done(), w.informer.HasSynced) {
		return fmt.Errorf("kubernetes informer unable to sync cache")
	}

Yep we are using context.TODO() in the autodiscover package.

ctx, cancel := context.WithCancel(context.TODO())
w := &watcher{
client: client,
informer: informer,
store: store,
queue: queue,
ctx: ctx,

But we still properly close the channel by calling the cancel()/stop() at

, no?

At
https://github.com/elastic/elastic-agent/blob/e7419ddc4e492e7e41acd20ec39f570937affe13/internal/pkg/composable/providers/kubernetes/kubernetes.go#L94
all the eventers and hence their watchers will be stopped. And when a watcher is stopped the cancel/stop is called so I expect that the informer is stopped too 🤔

@cmacknz did you manage to reproduce it? Could you share the steps? I would like to troubleshoot it further.

@cmacknz
Copy link
Member Author

cmacknz commented Feb 8, 2023

The ECK team originally caught this in their integration tests. It was not 100% reproducible. I would get in touch with them. Easiest way is probably to just follow up in the issue on the ECK side where we were originally diagnosing this elastic/cloud-on-k8s#6331 (comment)

@cmacknz
Copy link
Member Author

cmacknz commented Feb 8, 2023

We eventually end up stuck in cache.WaitForCacheSync so perhaps the problem is there specifically.

0x559a91baa2ed k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d

@ChrsMark
Copy link
Member

ChrsMark commented Feb 9, 2023

We eventually end up stuck in cache.WaitForCacheSync so perhaps the problem is there specifically.

0x559a91baa2ed k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d

That looks weird. I cannot see why the WaitForCacheSync to be stalled exactly when the the Agent/provider/watcher is to be closed.
This function is supposed to just return after the caches are populated. Source code: WaitForCacheSync
And then the watcher will actually start doing the watch process at

w.logger.Debugf("cache sync done")
.
If the WaitForCacheSync was stalled the watcher wouldn't do the watch process at all. Means no discovered Pods. Is this true for their use-case?
I'm not sure how this result happened tbh but it if WaitForCacheSync is stalled I think the autodiscovery is not working at all and it's no matter of the close/stop process.
@gizas do you have any other ideas/thoughts here?

@gizas
Copy link
Contributor

gizas commented Feb 24, 2023

I tried to follow the flow from the beginning so I might repeat things said above (sorry in advance):

  1. Application.go starts coordinator and composable (https://github.com/elastic/elastic-agent/blob/fdd1465eb0551eed52130810e52859658b5367f9/internal/pkg/agent/application/application.go#L125-L130)
    Composable package includes all providers
  2. Coordinator starts channels for 3 Config Managers. Composable is the VarsManager Manager. So indeed we start a channel and we pass to it a ctx withCancel https://github.com/elastic/elastic-agent/blob/fdd1465eb0551eed52130810e52859658b5367f9/internal/pkg/agent/application/coordinator/coordinator.go#L556
  3. On the composable side we first call the https://github.com/elastic/elastic-agent/blob/main/internal/pkg/composable/controller.go#L49
    This New function creates the controller with all providers but I dont see contexts from cooridinaror or viseversa being exchanged here? I think this where we miss something.

When we want to run agent shutdown, this means that coordinator should terminate a context to pass this to composable is not it? But we dont have this common context now.

0x559a91baa2ed k8s.io/client-go/tools/cache.WaitForCacheSync+0x8d

If this is stalled I agree that the autodiscovery will completely fail. Only case I can think is that we try to close this thread while is still running.

@gizas
Copy link
Contributor

gizas commented Feb 28, 2023

@cmacknz I will refine my above answer, with following flow:

Coodinator runs the provider and passes ctx -----> We create another localCtx context (based on previous ctx )which is passed to dynamic Provider -----> For eg. for kubernetes.go provider we will stop it only when comm.Done(https://github.com/elastic/elastic-agent/blob/main/internal/pkg/composable/providers/kubernetes/kubernetes.go#L92) is received.

So I am not sure if we send such signal by this https://github.com/elastic/elastic-agent/blob/fdd1465eb0551eed52130810e52859658b5367f9/internal/pkg/agent/application/coordinator/coordinator.go#L557

We can not replicate it at the moment. Do you have any ideas how we can replicate such scenario in kubernetes? Can somehow stop agent and this might trigger it?

@cmacknz
Copy link
Member Author

cmacknz commented Feb 28, 2023

The ECK team has integration tests that experience this so often they had to disable the tests. See elastic/cloud-on-k8s#6331 (comment) for example. I would ask the ECK team how to set up their test environment, you can probably just reply to that issue.

The stack trace here isn't lying to us about where the agent is stuck, although it might not be exactly obvious why. My suspicion is still the same as inn #41 I can see that elastic-agent-autodiscover's watcher.go is using a placeholder context that was not passed in by the caller, and therefore can never be cancelled or time out.

@cmacknz cmacknz transferred this issue from elastic/elastic-agent Mar 21, 2023
@cmacknz cmacknz changed the title The agent can deadlock on shutdown when running on Kubernetes Remove use of context.TODO() to ensure all requests can be cancelled Mar 21, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Mar 21, 2023

We changed the agent so that it will no longer wait forever for the dynamic providers to exit in elastic/elastic-agent#2352.

I've moved this into the autodiscover repository and reworded the description to specify that this was specifically about cache.WaitForCacheSync not being cancelled on shutdown.

@cmacknz cmacknz changed the title Remove use of context.TODO() to ensure all requests can be cancelled cache.WaitForCacheSync may never exit on shutdown Mar 21, 2023
@gizas
Copy link
Contributor

gizas commented Jan 12, 2024

Closing this as per #41 (comment)

@gizas gizas closed this as completed Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team Team:Elastic-Agent Label for the Agent team v8.6.0
Projects
None yet
Development

No branches or pull requests

3 participants