Add timeout to rungroup shutdown #1481

RebeccaMahany · 2023-11-29T20:11:43Z

Ensure that the rungroup will still exit in a timely manner even if a rungroup actor a) doesn't interrupt in a timely manner or b) doesn't terminate its execution function in a timely manner. This prevents any rungroup actor from holding launcher in a suspended state, where it isn't doing anything but also isn't exiting in order to reload itself.

Relates to #1205

… on a particular actor

directionless

I think this is okay. Need to chew on it a little, I think it has surprising complexity

directionless · 2023-11-30T04:50:29Z

pkg/rungroup/rungroup.go

-		e := <-errors
-		level.Debug(g.logger).Log("msg", "successfully interrupted actor", "actor", e.errorSourceName, "index", i)
+		select {
+		case <-timeoutTimer.C:


I think this case potentially leaks goroutines, which might leave a zombie process? Might be better than hanging, might need a bigger hammer

It will in the case of an autoupdate where we're reloading launcher, yeah -- that seemed preferable to hanging. The bigger hammer I can think of is just os.Exit and let launchctl/systemctl/service manager restart launcher, but I know we've been hesitant to do that in the past.

directionless · 2023-11-30T04:53:08Z

pkg/rungroup/rungroup.go

@@ -65,15 +73,39 @@ func (g *Group) Run() error {
 	level.Debug(g.logger).Log("msg", "received interrupt error from first actor -- shutting down other actors", "err", initialActorErr)

 	// Signal all actors to stop.
+	numActors := int64(len(g.actors))
+	interruptWait := semaphore.NewWeighted(numActors)


semaphore.NewWeighted is interesting as compared to a waitgroup. I think it's good, though a little harder to think about all the edges.

Maybe cleaner than the channels I used for osquery/osquery-go#108, less sure about perf. But that's not a real concern here.

directionless

I think it's good to try.

It feels a little weird, because this rungroup code has all this shutdown/cleanup logic hiding in the start method. But that's where it's always been, and it seems fine

RebeccaMahany added 2 commits November 29, 2023 14:42

Add timeouts for rungroup shutdown so that we won't wait indefinitely…

043918a

… on a particular actor

Small logging improvements

2a1fbee

RebeccaMahany mentioned this pull request Nov 29, 2023

Our run group idiom is confusing #1205

Open

directionless reviewed Nov 30, 2023

View reviewed changes

RebeccaMahany added 2 commits November 30, 2023 16:10

Merge branch 'main' into becca/rungroup-shutdown-timeout

e9d6928

Merge branch 'main' into becca/rungroup-shutdown-timeout

83bada1

directionless approved these changes Dec 1, 2023

View reviewed changes

RebeccaMahany added this pull request to the merge queue Dec 4, 2023

Merged via the queue into kolide:main with commit f64926b Dec 4, 2023
24 checks passed

RebeccaMahany deleted the becca/rungroup-shutdown-timeout branch December 4, 2023 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout to rungroup shutdown #1481

Add timeout to rungroup shutdown #1481

RebeccaMahany commented Nov 29, 2023 •

edited

Loading

directionless left a comment

directionless Nov 30, 2023 •

edited

Loading

RebeccaMahany Nov 30, 2023

directionless Nov 30, 2023

directionless left a comment

Add timeout to rungroup shutdown #1481

Add timeout to rungroup shutdown #1481

Conversation

RebeccaMahany commented Nov 29, 2023 • edited Loading

directionless left a comment

Choose a reason for hiding this comment

directionless Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

RebeccaMahany Nov 30, 2023

Choose a reason for hiding this comment

directionless Nov 30, 2023

Choose a reason for hiding this comment

directionless left a comment

Choose a reason for hiding this comment

RebeccaMahany commented Nov 29, 2023 •

edited

Loading

directionless Nov 30, 2023 •

edited

Loading