-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix agent shutdown on SIGINT #1258
Changes from 1 commit
2b82b29
d5f7de0
f76a532
bfbfc32
6861bcb
6b7aee4
7c60dab
487fec7
4adbcce
823686e
f0fceed
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
module github.com/elastic/elastic-agent | ||
|
||
go 1.17 | ||
go 1.18 | ||
|
||
require ( | ||
github.com/Microsoft/go-winio v0.5.2 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -124,7 +124,7 @@ func newFleetGatewayWithScheduler( | |
acker: acker, | ||
stateFetcher: stateFetcher, | ||
stateStore: stateStore, | ||
errCh: make(chan error), | ||
errCh: make(chan error, 1), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here? The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed managed_mode coordination with fleet gateway. Now the gateway errors reading loop waits until gateway exits. Otherwise if the gateway shuts down out of sequence blocks it can block on errCh. |
||
}, nil | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -176,7 +176,10 @@ func (s *componentRuntimeState) destroy() { | |
if s.watchCanceller != nil { | ||
s.watchCanceller() | ||
s.watchCanceller = nil | ||
<-s.watchChan | ||
// Do not wait on watchChan here | ||
// the watch loop calls stateChanged that calls "destroy" (this function) and will block forever | ||
// since the watch channel will never be closed because the watch loop is blocked on stateChanged call | ||
// <-s.watchChan | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to ensure that we wait for the watcher goroutine does stop. This seems like a hacky way of fixing it, by just commenting it out. Maybe we could use a different channel to know when its done? Or a waitgroup? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll rewrite how the runners are managed, it currently deadlocks on wait for this channel to be closed while it can't be closed until you return from this function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is rewritten now, think it's cleaner |
||
} | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -83,7 +83,7 @@ func newRuntimeComm(logger *logger.Logger, listenAddr string, ca *authority.Cert | |
token: token.String(), | ||
cert: pair, | ||
checkinConn: true, | ||
checkinExpected: make(chan *proto.CheckinExpected), | ||
checkinExpected: make(chan *proto.CheckinExpected, 1), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why the change here? I would prefer to not buffer expected configurations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so for this one the "command" runtime sends "CheckinExpected" while the beat for example failed to connect over gRPC and the V2 checkin call from the client is never called. and the CheckinExpected called from the "command" runtime is blocked forever until the components comes online and kicks off the "check-in" sequence. We can either:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pushed another commit with the change in runtime_comms to addresses this particular issue according to the explanation above, implementing the caching of the latest state in the buffered channel of size 1. |
||
checkinObserved: make(chan *proto.CheckinObserved), | ||
actionsConn: true, | ||
actionsRequest: make(chan *proto.ActionRequest), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain to me why this is needed? The
select/case
reads from all these channels, so why would an unbuffered channel be needed here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the places where the unbuffered channels were changed to buffered are the channels that were blocking on shutdown. So it was either something was writing to that channel that nothing was reading from (exited read loop before the writers were fully stopped) or reading from the channel that nothing was writing to.
This fix is not ideal, I understand. Might have to fix more code in order to avoid buffered channels then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't like having/using buffered here. it signals our logic is broken somewhere. we should aim on fixing the real issue not the symptom.
however if we agree that this is sufficient for now add a TODO there at least and create an issue. otherwise we're creating technical debt which we will probably never deal with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is changed back now, after fixing some underlying components, no longer block on this chans