-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic in Event Broadcaster #1367
Comments
Looking deeper into it, it seems this can be caused by our own controller. We dynamically start and stop Managers at runtime, to react to kubeconfigs appearing/disappearing in the main cluster (and for each of these kubeconfig Secrets, we would reconcile a set of controllers). To prevent each new controller instance from having to re-cache everything, we used the same cache for all instances, i.e. we overwrite NewCache:
Removing the custom NewCache function makes the panic go away. |
And going further, to prevent starts and stops, we wrapped a cache in this beauty: // unstartableCache is used to prevent the ctrlruntime manager from starting the
// cache *again*, just after we started and initialized it.
type unstartableCache struct {
cache.Cache
}
func (m *unstartableCache) Start(ctx context.Context) error {
return nil
}
func (m *unstartableCache) WaitForCacheSync(ctx context.Context) bool {
return true
} Judging from the InformerMap's // Start calls Run on each of the informers and sets started to true. Blocks on the context.
func (m *InformersMap) Start(ctx context.Context) error {
go m.structured.Start(ctx)
go m.unstructured.Start(ctx)
go m.metadata.Start(ctx)
<-ctx.Done()
return nil
} ... it seemed we are missing the blocking, and instead return immediately. Patching the func (m *unstartableCache) Start(ctx context.Context) error {
<-ctx.Done()
return nil
} makes the panic go away. So I guess this is just me not completely adjusting everything to the new context handling in ctrlruntime 0.7+, I guess? |
I was just about to write an issue about this:
package main
import (
"context"
"os"
"time"
corev1 "k8s.io/api/core/v1"
"k8s.io/client-go/tools/record"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
"sigs.k8s.io/controller-runtime/pkg/reconcile"
)
type podController struct {
recorder record.EventRecorder
client.Client
}
func (p *podController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
ctrl.Log.WithName("controllers").WithName("pod-controller").WithValues("pod", req.NamespacedName).Info("reconciling")
result := &corev1.Pod{}
err := p.Get(ctx, req.NamespacedName, result)
if err != nil {
return reconcile.Result{}, err
}
// simulate some work
time.Sleep(time.Second * 4)
p.recorder.Event(result, "Warning", "some reason", "some message")
return reconcile.Result{}, nil
}
func main() {
ctrl.SetLogger(zap.New())
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
LeaderElection: false,
})
if err != nil {
os.Exit(1)
}
if err := ctrl.NewControllerManagedBy(mgr).For(&corev1.Pod{}).Complete(&podController{
Client: mgr.GetClient(),
recorder: mgr.GetEventRecorderFor("pod-controller"),
}); err != nil {
ctrl.Log.Error(err, "unable to add pod controller")
os.Exit(1)
}
ctx, cancel := context.WithCancel(context.Background())
go func() {
if err := mgr.Start(ctx); err != nil {
ctrl.Log.Error(err, "unable to continue running manager")
os.Exit(1)
}
}()
// let it run for a couple of seconds so it can start reconciling
time.Sleep(time.Second * 3)
// stop channel
cancel()
// let it run for a couple more seconds
time.Sleep(time.Second * 4)
} The code above fails reliably all the time. What's happening:
In general I think that client-go's recorder should accept Context to be aware of such cancelations. |
The error from above is for |
The broadcaster can be shutdown, but it appears it doesn't properly handle events after that (should it?) . See also https://github.com/kubernetes/kubernetes/pull/95664/files for a recent change to make it properly shutdown by @DirectXMan12 |
I've applied the changes from kubernetes/kubernetes#95664 however the behavior is still the same. What I've done is to check if shutdown of the broadcaster has already happened. diff --git a/staging/src/k8s.io/apimachinery/pkg/watch/mux.go b/staging/src/k8s.io/apimachinery/pkg/watch/mux.go
index e01d519060b..b67c729d967 100644
--- a/staging/src/k8s.io/apimachinery/pkg/watch/mux.go
+++ b/staging/src/k8s.io/apimachinery/pkg/watch/mux.go
@@ -211,14 +212,32 @@ func (m *Broadcaster) closeAll() {
// Action distributes the given event among all watchers.
func (m *Broadcaster) Action(action EventType, obj runtime.Object) {
- m.incoming <- Event{action, obj}
+ select {
+ case <-m.stopped:
+ return
+ default:
+ m.incoming <- Event{action, obj}
+ }
}
// Action distributes the given event among all watchers, or drops it on the floor
// if too many incoming actions are queued up. Returns true if the action was sent,
// false if dropped.
func (m *Broadcaster) ActionOrDrop(action EventType, obj runtime.Object) bool {
+ // select ordering is not deterministic, and it would match adding the event to
+ // the incoming channel. A separate select is needed to avoid this problem.
+ // https://golang.org/ref/spec#Select_statements
+ select {
+ case <-m.stopped:
+ return false
+ default:
+ }
+
select {
+ // very unlikely that the broadcaster might be stopped while this is called,
+ // but just to be on the safe side.
+ // case <-m.stopped:
+ // return false
case m.incoming <- Event{action, obj}:
return true
default: |
Folks, could you test the latest |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
While updating from 0.6.5 to 0.8.1, I noticed that our code suddenly began panic'ing:
The offending code tries to create an Event after a reconcile loop ended in an error:
The text was updated successfully, but these errors were encountered: