Panic in Event Broadcaster #1367

xrstf · 2021-02-03T14:22:53Z

While updating from 0.6.5 to 0.8.1, I noticed that our code suddenly began panic'ing:

E0203 15:15:41.247244 3880689 runtime.go:78] Observed a panic: "send on closed channel" (send on closed channel)

goroutine 751 [running]:
vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1744ee0, 0x1ae6a00)
        /home/.../vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/.../vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x1744ee0, 0x1ae6a00)
        /usr/local/go/src/runtime/panic.go:969 +0x175
vendor/k8s.io/apimachinery/pkg/watch.(*Broadcaster).Action(...)
        /home/.../vendor/k8s.io/apimachinery/pkg/watch/mux.go:189
vendor/k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1(0xc000f83680, 0xc000afd480)
        /home/.../vendor/k8s.io/client-go/tools/record/event.go:345 +0x14e
created by vendor/k8s.io/client-go/tools/record.(*recorderImpl).generateEvent
        /home/.../vendor/k8s.io/client-go/tools/record/event.go:341 +0x3bd
panic: send on closed channel [recovered]
        panic: send on closed channel

goroutine 751 [running]:
vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/.../vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x1744ee0, 0x1ae6a00)
        /usr/local/go/src/runtime/panic.go:969 +0x175
vendor/k8s.io/apimachinery/pkg/watch.(*Broadcaster).Action(...)
        /home/.../vendor/k8s.io/apimachinery/pkg/watch/mux.go:189
vendor/k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1(0xc000f83680, 0xc000afd480)
        /home/.../vendor/k8s.io/client-go/tools/record/event.go:345 +0x14e
created by vendor/k8s.io/client-go/tools/record.(*recorderImpl).generateEvent
        /home/.../vendor/k8s.io/client-go/tools/record/event.go:341 +0x3bd

The offending code tries to create an Event after a reconcile loop ended in an error:

myRecorder := myManager.GetEventRecorderFor(myControllerName) //  this is done once when the controller is created

// later, in Reconcile(), I have this line:
myRecorder.Event(&myObject, corev1.EventTypeWarning, "ReconcilingError", err.Error())

The text was updated successfully, but these errors were encountered:

xrstf · 2021-02-03T14:42:46Z

Looking deeper into it, it seems this can be caused by our own controller. We dynamically start and stop Managers at runtime, to react to kubeconfigs appearing/disappearing in the main cluster (and for each of these kubeconfig Secrets, we would reconcile a set of controllers).

To prevent each new controller instance from having to re-cache everything, we used the same cache for all instances, i.e. we overwrite NewCache:

mgr, err := manager.New(r.masterKubeCfg, manager.Options{
	LeaderElection:     false,
	MetricsBindAddress: "0",
	// Avoid duplicating caches or client for main cluster, as it's static.
	NewCache: func(_ *rest.Config, _ cache.Options) (cache.Cache, error) {
		return prewarmedCache, nil
	},
})

Removing the custom NewCache function makes the panic go away.

xrstf · 2021-02-03T14:48:29Z

And going further, to prevent starts and stops, we wrapped a cache in this beauty:

// unstartableCache is used to prevent the ctrlruntime manager from starting the
// cache *again*, just after we started and initialized it.
type unstartableCache struct {
	cache.Cache
}

func (m *unstartableCache) Start(ctx context.Context) error {
	return nil
}

func (m *unstartableCache) WaitForCacheSync(ctx context.Context) bool {
	return true
}

Judging from the InformerMap's Start() function:

// Start calls Run on each of the informers and sets started to true.  Blocks on the context.
func (m *InformersMap) Start(ctx context.Context) error {
	go m.structured.Start(ctx)
	go m.unstructured.Start(ctx)
	go m.metadata.Start(ctx)
	<-ctx.Done()
	return nil
}

... it seemed we are missing the blocking, and instead return immediately. Patching the unstartableCache to have

func (m *unstartableCache) Start(ctx context.Context) error {
	<-ctx.Done()
	return nil
}

makes the panic go away. So I guess this is just me not completely adjusting everything to the new context handling in ctrlruntime 0.7+, I guess?

mvladev · 2021-02-03T16:11:22Z

I was just about to write an issue about this:

{"level":"info","ts":1612367875.5065799,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1612367875.506887,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1612367875.50695,"logger":"controller-runtime.manager.controller.pod","msg":"Starting EventSource","reconciler group":"","reconciler kind":"Pod","source":"kind source: /, Kind="}
{"level":"info","ts":1612367875.6113448,"logger":"controller-runtime.manager.controller.pod","msg":"Starting Controller","reconciler group":"","reconciler kind":"Pod"}
{"level":"info","ts":1612367875.611406,"logger":"controller-runtime.manager.controller.pod","msg":"Starting workers","reconciler group":"","reconciler kind":"Pod","worker count":1}
{"level":"info","ts":1612367875.611577,"logger":"controllers.pod-controller","msg":"reconciling","pod":"kube-system/kube-controller-manager-docker-desktop"}
{"level":"info","ts":1612367878.5104551,"logger":"controller-runtime.manager.controller.pod","msg":"Stopping workers","reconciler group":"","reconciler kind":"Pod"}
{"level":"info","ts":1612367879.614225,"logger":"controllers.pod-controller","msg":"reconciling","pod":"my-ns/my-pod"}
E0203 17:57:59.614346   36681 runtime.go:78] Observed a panic: "send on closed channel" (send on closed channel)
goroutine 340 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x211b7e0, 0x24433e0)
         /go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
         /go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x211b7e0, 0x24433e0)
        /usr/local/opt/go/libexec/src/runtime/panic.go:969 +0x1b9
k8s.io/apimachinery/pkg/watch.(*Broadcaster).Action(...)
         /go/src/k8s.io/apimachinery/pkg/watch/mux.go:189
k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1(0xc00072a000, 0xc000218500)
         /go/src/k8s.io/client-go/tools/record/event.go:344 +0xb7
created by k8s.io/client-go/tools/record.(*recorderImpl).generateEvent
         /go/src/k8s.io/client-go/tools/record/event.go:341 +0x32f
panic: send on closed channel [recovered]
        panic: send on closed channel

goroutine 340 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
         /go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x211b7e0, 0x24433e0)
        /usr/local/opt/go/libexec/src/runtime/panic.go:969 +0x1b9
k8s.io/apimachinery/pkg/watch.(*Broadcaster).Action(...)
         /go/src/k8s.io/apimachinery/pkg/watch/mux.go:189
k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1(0xc00072a000, 0xc000218500)
         /go/src/k8s.io/client-go/tools/record/event.go:344 +0xb7
created by k8s.io/client-go/tools/record.(*recorderImpl).generateEvent
         /go/src/k8s.io/client-go/tools/record/event.go:341 +0x32f
exit status 2

package main

import (
	"context"
	"os"
	"time"

	corev1 "k8s.io/api/core/v1"
	"k8s.io/client-go/tools/record"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log/zap"
	"sigs.k8s.io/controller-runtime/pkg/reconcile"
)

type podController struct {
	recorder record.EventRecorder
	client.Client
}

func (p *podController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	ctrl.Log.WithName("controllers").WithName("pod-controller").WithValues("pod", req.NamespacedName).Info("reconciling")

	result := &corev1.Pod{}
	err := p.Get(ctx, req.NamespacedName, result)
	if err != nil {
		return reconcile.Result{}, err
	}

        // simulate some work
	time.Sleep(time.Second * 4)

	p.recorder.Event(result, "Warning", "some reason", "some message")
	return reconcile.Result{}, nil
}

func main() {
	ctrl.SetLogger(zap.New())

	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		LeaderElection: false,
	})
	if err != nil {
		os.Exit(1)
	}

	if err := ctrl.NewControllerManagedBy(mgr).For(&corev1.Pod{}).Complete(&podController{
		Client:   mgr.GetClient(),
		recorder: mgr.GetEventRecorderFor("pod-controller"),
	}); err != nil {
		ctrl.Log.Error(err, "unable to add pod controller")
		os.Exit(1)
	}

	ctx, cancel := context.WithCancel(context.Background())

	go func() {
		if err := mgr.Start(ctx); err != nil {
			ctrl.Log.Error(err, "unable to continue running manager")
			os.Exit(1)
		}
	}()

       // let it run for a couple of seconds so it can start reconciling
	time.Sleep(time.Second * 3) 
       // stop channel
	cancel()
       // let it run for a couple more seconds
	time.Sleep(time.Second * 4) 
}

The code above fails reliably all the time. What's happening:

Controller Manager + Pod Controller + Event recorder starts
The Pod controller receives one Pod to reconcile
The reconcile operation is "slowed" down by adding artificial work time.Sleep(time.Second * 4)
While that reconciliation happens, the controller manager is shut down
After the 4 seconds pass, the controller tries to use the event BroadCaster to send event that causes the panic

In general I think that client-go's recorder should accept Context to be aware of such cancelations.

mvladev · 2021-02-03T16:13:40Z

The error from above is for sigs.k8s.io/controller-runtime v0.7.1

alvaroaleman · 2021-02-03T19:06:02Z

The broadcaster can be shutdown, but it appears it doesn't properly handle events after that (should it?) . See also https://github.com/kubernetes/kubernetes/pull/95664/files for a recent change to make it properly shutdown by @DirectXMan12

mvladev · 2021-02-04T10:06:38Z

I've applied the changes from kubernetes/kubernetes#95664 however the behavior is still the same.

What I've done is to check if shutdown of the broadcaster has already happened.

diff --git a/staging/src/k8s.io/apimachinery/pkg/watch/mux.go b/staging/src/k8s.io/apimachinery/pkg/watch/mux.go
index e01d519060b..b67c729d967 100644
--- a/staging/src/k8s.io/apimachinery/pkg/watch/mux.go
+++ b/staging/src/k8s.io/apimachinery/pkg/watch/mux.go
@@ -211,14 +212,32 @@ func (m *Broadcaster) closeAll() {
 
 // Action distributes the given event among all watchers.
 func (m *Broadcaster) Action(action EventType, obj runtime.Object) {
-	m.incoming <- Event{action, obj}
+	select {
+	case <-m.stopped:
+		return
+	default:
+		m.incoming <- Event{action, obj}
+	}
 }
 
 // Action distributes the given event among all watchers, or drops it on the floor
 // if too many incoming actions are queued up.  Returns true if the action was sent,
 // false if dropped.
 func (m *Broadcaster) ActionOrDrop(action EventType, obj runtime.Object) bool {
+	// select ordering is not deterministic, and it would match adding the event to
+	// the incoming channel. A separate select is needed to avoid this problem.
+	// https://golang.org/ref/spec#Select_statements
+	select {
+	case <-m.stopped:
+		return false
+	default:
+	}
+
 	select {
+	// very unlikely that the broadcaster might be stopped while this is called,
+	// but just to be on the safe side.
+	// case <-m.stopped:
+	// 	return false
 	case m.incoming <- Event{action, obj}:
 		return true
 	default:

vincepri · 2021-03-25T14:47:33Z

Folks, could you test the latest v0.9.0-alpha.1 tag which includes #1451?

fejta-bot · 2021-06-23T15:29:26Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

k8s-triage-robot · 2021-07-26T18:13:36Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

k8s-triage-robot · 2021-08-25T19:11:44Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-08-25T19:11:52Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mvladev mentioned this issue Feb 4, 2021

Event recording no longer panics on stopped broadcaster. kubernetes/kubernetes#98764

Closed

llamerada-jp mentioned this issue May 6, 2021

Data race topolvm/pvc-autoresizer#37

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 26, 2021

k8s-ci-robot closed this as completed Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic in Event Broadcaster #1367

Panic in Event Broadcaster #1367

xrstf commented Feb 3, 2021

xrstf commented Feb 3, 2021

xrstf commented Feb 3, 2021

mvladev commented Feb 3, 2021

mvladev commented Feb 3, 2021

alvaroaleman commented Feb 3, 2021

mvladev commented Feb 4, 2021

vincepri commented Mar 25, 2021

fejta-bot commented Jun 23, 2021

k8s-triage-robot commented Jul 26, 2021

k8s-triage-robot commented Aug 25, 2021

k8s-ci-robot commented Aug 25, 2021

Panic in Event Broadcaster #1367

Panic in Event Broadcaster #1367

Comments

xrstf commented Feb 3, 2021

xrstf commented Feb 3, 2021

xrstf commented Feb 3, 2021

mvladev commented Feb 3, 2021

mvladev commented Feb 3, 2021

alvaroaleman commented Feb 3, 2021

mvladev commented Feb 4, 2021

vincepri commented Mar 25, 2021

fejta-bot commented Jun 23, 2021

k8s-triage-robot commented Jul 26, 2021

k8s-triage-robot commented Aug 25, 2021

k8s-ci-robot commented Aug 25, 2021