Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in Event Broadcaster #1367

Closed
xrstf opened this issue Feb 3, 2021 · 11 comments
Closed

Panic in Event Broadcaster #1367

xrstf opened this issue Feb 3, 2021 · 11 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@xrstf
Copy link
Contributor

xrstf commented Feb 3, 2021

While updating from 0.6.5 to 0.8.1, I noticed that our code suddenly began panic'ing:

E0203 15:15:41.247244 3880689 runtime.go:78] Observed a panic: "send on closed channel" (send on closed channel)

goroutine 751 [running]:
vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1744ee0, 0x1ae6a00)
        /home/.../vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/.../vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x1744ee0, 0x1ae6a00)
        /usr/local/go/src/runtime/panic.go:969 +0x175
vendor/k8s.io/apimachinery/pkg/watch.(*Broadcaster).Action(...)
        /home/.../vendor/k8s.io/apimachinery/pkg/watch/mux.go:189
vendor/k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1(0xc000f83680, 0xc000afd480)
        /home/.../vendor/k8s.io/client-go/tools/record/event.go:345 +0x14e
created by vendor/k8s.io/client-go/tools/record.(*recorderImpl).generateEvent
        /home/.../vendor/k8s.io/client-go/tools/record/event.go:341 +0x3bd
panic: send on closed channel [recovered]
        panic: send on closed channel

goroutine 751 [running]:
vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/.../vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x1744ee0, 0x1ae6a00)
        /usr/local/go/src/runtime/panic.go:969 +0x175
vendor/k8s.io/apimachinery/pkg/watch.(*Broadcaster).Action(...)
        /home/.../vendor/k8s.io/apimachinery/pkg/watch/mux.go:189
vendor/k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1(0xc000f83680, 0xc000afd480)
        /home/.../vendor/k8s.io/client-go/tools/record/event.go:345 +0x14e
created by vendor/k8s.io/client-go/tools/record.(*recorderImpl).generateEvent
        /home/.../vendor/k8s.io/client-go/tools/record/event.go:341 +0x3bd

The offending code tries to create an Event after a reconcile loop ended in an error:

myRecorder := myManager.GetEventRecorderFor(myControllerName) //  this is done once when the controller is created

// later, in Reconcile(), I have this line:
myRecorder.Event(&myObject, corev1.EventTypeWarning, "ReconcilingError", err.Error())
@xrstf
Copy link
Contributor Author

xrstf commented Feb 3, 2021

Looking deeper into it, it seems this can be caused by our own controller. We dynamically start and stop Managers at runtime, to react to kubeconfigs appearing/disappearing in the main cluster (and for each of these kubeconfig Secrets, we would reconcile a set of controllers).

To prevent each new controller instance from having to re-cache everything, we used the same cache for all instances, i.e. we overwrite NewCache:

mgr, err := manager.New(r.masterKubeCfg, manager.Options{
	LeaderElection:     false,
	MetricsBindAddress: "0",
	// Avoid duplicating caches or client for main cluster, as it's static.
	NewCache: func(_ *rest.Config, _ cache.Options) (cache.Cache, error) {
		return prewarmedCache, nil
	},
})

Removing the custom NewCache function makes the panic go away.

@xrstf
Copy link
Contributor Author

xrstf commented Feb 3, 2021

And going further, to prevent starts and stops, we wrapped a cache in this beauty:

// unstartableCache is used to prevent the ctrlruntime manager from starting the
// cache *again*, just after we started and initialized it.
type unstartableCache struct {
	cache.Cache
}

func (m *unstartableCache) Start(ctx context.Context) error {
	return nil
}

func (m *unstartableCache) WaitForCacheSync(ctx context.Context) bool {
	return true
}

Judging from the InformerMap's Start() function:

// Start calls Run on each of the informers and sets started to true.  Blocks on the context.
func (m *InformersMap) Start(ctx context.Context) error {
	go m.structured.Start(ctx)
	go m.unstructured.Start(ctx)
	go m.metadata.Start(ctx)
	<-ctx.Done()
	return nil
}

... it seemed we are missing the blocking, and instead return immediately. Patching the unstartableCache to have

func (m *unstartableCache) Start(ctx context.Context) error {
	<-ctx.Done()
	return nil
}

makes the panic go away. So I guess this is just me not completely adjusting everything to the new context handling in ctrlruntime 0.7+, I guess?

@mvladev
Copy link

mvladev commented Feb 3, 2021

I was just about to write an issue about this:

{"level":"info","ts":1612367875.5065799,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1612367875.506887,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1612367875.50695,"logger":"controller-runtime.manager.controller.pod","msg":"Starting EventSource","reconciler group":"","reconciler kind":"Pod","source":"kind source: /, Kind="}
{"level":"info","ts":1612367875.6113448,"logger":"controller-runtime.manager.controller.pod","msg":"Starting Controller","reconciler group":"","reconciler kind":"Pod"}
{"level":"info","ts":1612367875.611406,"logger":"controller-runtime.manager.controller.pod","msg":"Starting workers","reconciler group":"","reconciler kind":"Pod","worker count":1}
{"level":"info","ts":1612367875.611577,"logger":"controllers.pod-controller","msg":"reconciling","pod":"kube-system/kube-controller-manager-docker-desktop"}
{"level":"info","ts":1612367878.5104551,"logger":"controller-runtime.manager.controller.pod","msg":"Stopping workers","reconciler group":"","reconciler kind":"Pod"}
{"level":"info","ts":1612367879.614225,"logger":"controllers.pod-controller","msg":"reconciling","pod":"my-ns/my-pod"}
E0203 17:57:59.614346   36681 runtime.go:78] Observed a panic: "send on closed channel" (send on closed channel)
goroutine 340 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x211b7e0, 0x24433e0)
         /go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
         /go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x89
panic(0x211b7e0, 0x24433e0)
        /usr/local/opt/go/libexec/src/runtime/panic.go:969 +0x1b9
k8s.io/apimachinery/pkg/watch.(*Broadcaster).Action(...)
         /go/src/k8s.io/apimachinery/pkg/watch/mux.go:189
k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1(0xc00072a000, 0xc000218500)
         /go/src/k8s.io/client-go/tools/record/event.go:344 +0xb7
created by k8s.io/client-go/tools/record.(*recorderImpl).generateEvent
         /go/src/k8s.io/client-go/tools/record/event.go:341 +0x32f
panic: send on closed channel [recovered]
        panic: send on closed channel

goroutine 340 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
         /go/src/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x211b7e0, 0x24433e0)
        /usr/local/opt/go/libexec/src/runtime/panic.go:969 +0x1b9
k8s.io/apimachinery/pkg/watch.(*Broadcaster).Action(...)
         /go/src/k8s.io/apimachinery/pkg/watch/mux.go:189
k8s.io/client-go/tools/record.(*recorderImpl).generateEvent.func1(0xc00072a000, 0xc000218500)
         /go/src/k8s.io/client-go/tools/record/event.go:344 +0xb7
created by k8s.io/client-go/tools/record.(*recorderImpl).generateEvent
         /go/src/k8s.io/client-go/tools/record/event.go:341 +0x32f
exit status 2
package main

import (
	"context"
	"os"
	"time"

	corev1 "k8s.io/api/core/v1"
	"k8s.io/client-go/tools/record"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log/zap"
	"sigs.k8s.io/controller-runtime/pkg/reconcile"
)

type podController struct {
	recorder record.EventRecorder
	client.Client
}

func (p *podController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	ctrl.Log.WithName("controllers").WithName("pod-controller").WithValues("pod", req.NamespacedName).Info("reconciling")

	result := &corev1.Pod{}
	err := p.Get(ctx, req.NamespacedName, result)
	if err != nil {
		return reconcile.Result{}, err
	}

        // simulate some work
	time.Sleep(time.Second * 4)

	p.recorder.Event(result, "Warning", "some reason", "some message")
	return reconcile.Result{}, nil
}

func main() {
	ctrl.SetLogger(zap.New())

	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		LeaderElection: false,
	})
	if err != nil {
		os.Exit(1)
	}

	if err := ctrl.NewControllerManagedBy(mgr).For(&corev1.Pod{}).Complete(&podController{
		Client:   mgr.GetClient(),
		recorder: mgr.GetEventRecorderFor("pod-controller"),
	}); err != nil {
		ctrl.Log.Error(err, "unable to add pod controller")
		os.Exit(1)
	}

	ctx, cancel := context.WithCancel(context.Background())

	go func() {
		if err := mgr.Start(ctx); err != nil {
			ctrl.Log.Error(err, "unable to continue running manager")
			os.Exit(1)
		}
	}()

       // let it run for a couple of seconds so it can start reconciling
	time.Sleep(time.Second * 3) 
       // stop channel
	cancel()
       // let it run for a couple more seconds
	time.Sleep(time.Second * 4) 
}

The code above fails reliably all the time. What's happening:

  • Controller Manager + Pod Controller + Event recorder starts
  • The Pod controller receives one Pod to reconcile
  • The reconcile operation is "slowed" down by adding artificial work time.Sleep(time.Second * 4)
  • While that reconciliation happens, the controller manager is shut down
  • After the 4 seconds pass, the controller tries to use the event BroadCaster to send event that causes the panic

In general I think that client-go's recorder should accept Context to be aware of such cancelations.

@mvladev
Copy link

mvladev commented Feb 3, 2021

The error from above is for sigs.k8s.io/controller-runtime v0.7.1

@alvaroaleman
Copy link
Member

The broadcaster can be shutdown, but it appears it doesn't properly handle events after that (should it?) . See also https://github.com/kubernetes/kubernetes/pull/95664/files for a recent change to make it properly shutdown by @DirectXMan12

@mvladev
Copy link

mvladev commented Feb 4, 2021

I've applied the changes from kubernetes/kubernetes#95664 however the behavior is still the same.

What I've done is to check if shutdown of the broadcaster has already happened.

diff --git a/staging/src/k8s.io/apimachinery/pkg/watch/mux.go b/staging/src/k8s.io/apimachinery/pkg/watch/mux.go
index e01d519060b..b67c729d967 100644
--- a/staging/src/k8s.io/apimachinery/pkg/watch/mux.go
+++ b/staging/src/k8s.io/apimachinery/pkg/watch/mux.go
@@ -211,14 +212,32 @@ func (m *Broadcaster) closeAll() {
 
 // Action distributes the given event among all watchers.
 func (m *Broadcaster) Action(action EventType, obj runtime.Object) {
-	m.incoming <- Event{action, obj}
+	select {
+	case <-m.stopped:
+		return
+	default:
+		m.incoming <- Event{action, obj}
+	}
 }
 
 // Action distributes the given event among all watchers, or drops it on the floor
 // if too many incoming actions are queued up.  Returns true if the action was sent,
 // false if dropped.
 func (m *Broadcaster) ActionOrDrop(action EventType, obj runtime.Object) bool {
+	// select ordering is not deterministic, and it would match adding the event to
+	// the incoming channel. A separate select is needed to avoid this problem.
+	// https://golang.org/ref/spec#Select_statements
+	select {
+	case <-m.stopped:
+		return false
+	default:
+	}
+
 	select {
+	// very unlikely that the broadcaster might be stopped while this is called,
+	// but just to be on the safe side.
+	// case <-m.stopped:
+	// 	return false
 	case m.incoming <- Event{action, obj}:
 		return true
 	default:

@vincepri
Copy link
Member

Folks, could you test the latest v0.9.0-alpha.1 tag which includes #1451?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2021
@k8s-triage-robot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 26, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants