-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support double signal handling #285
Support double signal handling #285
Conversation
cmd/habitat-operator/main.go
Outdated
case <-ctx.Done(): | ||
level.Info(logger).Log("msg", "context channel closed, exiting") | ||
} | ||
cancelFunc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that after calling cancelFunc
, we should do <- term
again followed by os.Exit(1)
or something. So the second ctrl-c will quit the operator immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, would be nice to figure out with calling cancelFunc
twice won't give us problems. First call is after getting SIGTERM (or SIGINT) and the second call is deferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the first call, subsequent calls to a cancelFunc
will do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, see docs.
9ae26eb
to
46ed0b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been reviewing this PR and I realized that for graceful shutdown of the operator using just cancelFunc()
on SIGTERM is not enough. If we have main
function like this:
func main() {
ctx, cancelFunc := context.WithCancel(context.Background())
defer cancelFunc()
if err := v1beta1(ctx, …); err != nil {
…
os.Exit(1)
}
if err := v1beta2(ctx, …); err != nil {
…
os.Exit(1)
}
term := make(chan os.Signal)
// Relay these signals to the `term` channel.
signal.Notify(term, syscall.SIGINT, syscall.SIGTERM)
go func() {
<-term
level.Info(logger).Log("msg", "received SIGTERM, exiting gracefully...")
cancelFunc()
<-term
os.Exit(1)
}()
<-ctx.Done()
level.Info(logger).Log("msg", "context channel closed, exiting")
os.Exit(0)
}
Then pressing ctrl-c will call cancelFunc()
, which basically closes the ctx.Done()
channel. Thus the main
function will happily proceed to call os.Exit(0)
clobbering all the goroutines that didn't have a chance to react to the closing of the context channel.
I think we also should use sync.WaitGroup
. Like so, maybe:
func main() {
ctx, cancelFunc := context.WithCancel(context.Background())
// we don't do defer cancelFunc() anymore
var wg sync.WaitGroup
wg.Add(2) // 2, for each controller we run.
if err := v1beta1(ctx, &wg, …); err != nil {
…
os.Exit(1)
}
if err := v1beta2(ctx, &wg, …); err != nil {
…
os.Exit(1)
}
term := make(chan os.Signal)
// Relay these signals to the `term` channel.
signal.Notify(term, syscall.SIGINT, syscall.SIGTERM)
go func() {
<-term
level.Info(logger).Log("msg", "received SIGTERM, exiting gracefully...")
cancelFunc()
<-term
os.Exit(1)
}()
wg.Wait()
level.Info(logger).Log("msg", "controllers stopped, exiting")
os.Exit(0)
}
Each controller would then need to call wg.Done()
when it is done. I suppose that by being "done" I should mean that all the goroutines the controller spawns reacted to the closing of the context channel, by calling controller_specific_wg.Done()
, where controller_specific_wg
is yet another WaitGroup, this time created by the controller.
So in short - use context.Context
to tell goroutines to wrap up their work, and sync.WaitGroup
to learn if they did so, so we can gracefully quit.
cmd/habitat-operator/main.go
Outdated
@@ -112,12 +112,17 @@ func run() int { | |||
// Relay these signals to the `term` channel. | |||
signal.Notify(term, syscall.SIGINT, syscall.SIGTERM) | |||
|
|||
select { | |||
case <-term: | |||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make term
a buffered channel of size 2, since we know how many signals it will receive.
cmd/habitat-operator/main.go
Outdated
select { | ||
case <-term: | ||
go func() { | ||
<-term | ||
level.Info(logger).Log("msg", "received SIGTERM, exiting gracefully...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be SIGINT
, so I'd remove mention of the specific signal and just say: "received termination signal, exiting gracefully..."
cmd/habitat-operator/main.go
Outdated
}() | ||
|
||
<-ctx.Done() | ||
level.Info(logger).Log("msg", "context channel closed, exiting") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is now redundant when exiting because of a signal. Are there cases when ctx
gets closed and it's not because of signals?
I don't think so, so this message can probably go.
Great writeup @krnowak! |
bc07d5a
to
aee6526
Compare
@krnowak It's ready for another review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not enough, unfortunately. What I wanted to achieve is that we wait for all the goroutines we spawn to finish, before we quit the operator. With the current changes we only wait for the 2 goroutines to finish, the go controller.Run(…)
ones.
You can find the goroutines we spawn with git grep 'go ' -- cmd pkg/controller
(note the space after go
).
So, basically if you have a function that takes a context as a parameter, spawns a bunch of goroutines and that waits until context channel is closed, like this:
func foo(ctx context.Context, …) {
go fun1(ctx);
go fun2(ctx);
for i := 0; i < X; i++) {
go worker(ctx);
}
<- ctx.Done()
}
Then the change you could do is something like:
func foo(ctx context.Context, …) {
wg := sync.WaitGroup{}
wg.Add(2 + X)
go func() {
fun1(ctx);
wg.Done()
}()
go func() {
fun2(ctx);
wg.Done()
}()
for i := 0; i < X; i++) {
go func() {
worker(ctx);
wg.Done()
}()
}
// I'm not even sure if this line is needed anymore,
// because we wait for the waitgroup anyway
<- ctx.Done()
wg.Wait()
}
If the foo
function itself is being run in the goroutine, then you could convert the go
call into the same go func(){…}()
like I did for fun1
inside foo
.
The v1beta1
function in main.go
doesn't have that <-ctx.Done()
thing at its end, but you could transform the go
call just the same. That way you don't need to modify the prototype of the Run()
functions for example. But passing wg
to v1beta1
or v1beta2
functions still makes sense, though.
The v1beta2
function is a bit more tricky, because it spawns 3 goroutines (2 for some factories and 1 for the controller). I think I would just something like:
factoriesWg := sync.WaitGroup
factoriesWg.Add(2)
go func() {
someFactoryFunctionOrSomething1(ctx)
factoriesWg.Done()
}()
go func() {
someFactoryFunctionOrSomething2(ctx)
factoriesWg.Done()
}()
go func() {
controller.Run(…)
factoriesWg.Wait()
wg.Done()
}()
Hopefully it makes it clear what I meant by some controller_specific_wg
in my previous writeup.
One thing I missed. The |
Another thing - likely all the |
033efaf
to
3abe3a3
Compare
@krnowak Another one :) |
cmd/habitat-operator/main.go
Outdated
defer cancelFunc() | ||
|
||
var wg sync.WaitGroup | ||
wg.Add(4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer this number to remain 2. This is because we pass the waitgroup twice. I understand that you made it 4, because v1beta2
function spawns not one goroutine, but three, so you needed to call Wait
four times, not two. Still, I would prefer to keep the number of spawned goroutines in the function as an implementation detail. That way it can be easier to reason about the code, like "we create a waitgroup with a counter set to 2 and pass it to two functions, so if I add a third function (like v1beta3
), I need to bump this number to 3".
Right now the counter is 4, but we pass it to only two functions, so it may left the reader wondering why is that (do both functions decrement the counter twice?) and to be able to answer that, the reader needs to check both v1beta1
and v1beta2
functions to see what they do. Another problem - you need to remember to bump this number whenever you add another goroutine in either function, which is easy to forget, because the counter is not really in the vicinity of a body of either of the functions.
So, for the v1beta2
function, I posted a possible solution at #285 (review) (see the last snippet there, with the factoriesWg
thing).
cmd/habitat-operator/main.go
Outdated
@@ -144,12 +156,15 @@ func v1beta1(ctx context.Context, cSets Clientsets, logger log.Logger) error { | |||
return err | |||
} | |||
|
|||
go controller.Run(runtime.NumCPU(), ctx) | |||
go func() { | |||
go controller.Run(runtime.NumCPU(), ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you are at it, could please swap the order of parameters here? context.Context
should be the first parameter of a function. I suppose this is some kind of an expectation from static analysis tools or something.
Others could say that passing ctx
as a first parameter is idiomatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and drop the go
here. We don't want to spawn a goroutine inside the goroutine.
cmd/habitat-operator/main.go
Outdated
|
||
go controller.Run(runtime.NumCPU(), ctx) | ||
go func() { | ||
go controller.Run(runtime.NumCPU(), ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here about swapping the order of parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop go
too.
pkg/controller/v1beta1/controller.go
Outdated
@@ -232,7 +253,16 @@ func (hc *HabitatController) watchPods(ctx context.Context) { | |||
DeleteFunc: hc.handlePodDelete, | |||
}) | |||
|
|||
go c.Run(ctx.Done()) | |||
var wg sync.WaitGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this looks wrong. It looks like an elaborate way of saying c.Run(ctx.Done())
. ;)
What I think needs to be done there is that the watchPods
function should receive waitgroup as a parameter, instead of creating it and it should not call wg.Wait()
on it. Because right now, watchPods
function will block, while it was meant to be a nonblocking function that spawns a goroutine.
pkg/controller/v1beta1/controller.go
Outdated
go hc.deployInformer.Run(ctx.Done()) | ||
go hc.cmInformer.Run(ctx.Done()) | ||
var wg sync.WaitGroup | ||
wg.Add(3 + workers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will need to be bumped to 4, if we start passing it to hc.watchPods
too.
pkg/controller/v1beta2/controller.go
Outdated
wg.Add(3 + workers) | ||
|
||
go func() { | ||
go hc.habInformer.Run(ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop go
.
pkg/controller/v1beta2/controller.go
Outdated
}() | ||
|
||
go func() { | ||
go hc.stsInformer.Run(ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop go
.
pkg/controller/v1beta2/controller.go
Outdated
}() | ||
|
||
go func() { | ||
go hc.cmInformer.Run(ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop go
.
pkg/controller/v1beta2/controller.go
Outdated
@@ -176,12 +188,21 @@ func (hc *HabitatController) Run(workers int, ctx context.Context) error { | |||
// failed job, it will be restarted after a delay of 1 second. | |||
for i := 0; i < workers; i++ { | |||
level.Debug(hc.logger).Log("msg", "Starting worker", "id", i) | |||
go wait.Until(hc.worker, time.Second, ctx.Done()) | |||
go func() { | |||
go wait.Until(hc.worker, time.Second, ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop go
.
pkg/controller/v1beta2/controller.go
Outdated
wg.Add(1) | ||
|
||
go func() { | ||
go c.Run(ctx.Done()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop go
.
7d72e56
to
dfe1d04
Compare
Signed-off-by: Kosy Anyanwu <kosy@kinvolk.io>
dfe1d04
to
26f0ba1
Compare
@krnowak Another one 💯 |
LFAD. |
Fixes #220