-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Elastic Agent] Improve GRPC stop to be more relaxed. #20118
[Elastic Agent] Improve GRPC stop to be more relaxed. #20118
Conversation
Pinging @elastic/ingest-management (Team:Ingest Management) |
@@ -548,8 +549,10 @@ func (as *ApplicationState) Stop(timeout time.Duration) error { | |||
s := as.status | |||
doneChan := as.checkinDone | |||
as.checkinLock.RUnlock() | |||
if s == proto.StateObserved_STOPPING && doneChan == nil { | |||
// sent stopping and now is disconnected (so its stopped) | |||
if (wasConn && doneChan == nil) || (!wasConn && s == proto.StateObserved_STOPPING && doneChan == nil) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure i follow this. if it was connected and doneChan is nil means it got disconnected this seems ok.
second part means if status of application is stopping and doneChan is nil (got disconnected) then we;re destroying but only in case doneChan was nil before so nothing changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the second case is only in the case that Stop()
was called but the client was disconnected from the GRPC at that time (which is very rare, but possible).
So if the client was disconnected from the GPRC at the time Stop()
was called, it needs to know that it did receive the stopping state. So it waits for the client to send that it is actually stopping and then it has disconnected. This requires that the client actually reconnect to get the stopping message or timeout occurs, which ever comes first.
In the normal case the wasConn && doneChan == nil
will almost always be used in this loop.
Hopefully that it explains it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense thanks
* Improve stop to be more relaxed. * Add changelog. (cherry picked from commit 3811728)
* Improve stop to be more relaxed. * Add changelog. (cherry picked from commit 3811728)
…ne-2.0 * upstream/master: (41 commits) adding possibility to override content-type checks, it was breaking certain webhooks that is not able to set content-headers at all. Still defaults to application/json (elastic#20232) fix: use a fixed worker type for tests (elastic#20130) [Ingest Manager] Prepare packaging for endpoint and asc files (elastic#20186) [Packetbeat] HTTP: Improve support for 100-continue elastic#15830 (elastic#19349) Increase index.max_docvalue_fields_search to 200 (elastic#20218) [Ingest Manager] Prevent closing closed reader (elastic#20214) [Metricbeat] Use MySQL Host Parser in Query metricset (elastic#20191) Testing: Ignore timestamp from cylance/protect dataset (elastic#20211) [Filebeat] Ignore cylance.protect timestamps while testing (elastic#20207) [CI] remove codecov step (elastic#20102) [docs] Indicate that SYSTEM user is required on Windows to use Endpoint (elastic#20172) Remove f5/firepass rsa2elk fileset (elastic#20160) [Elastic Agent] Improve GRPC stop to be more relaxed. (elastic#20118) Fix fileset field prefixing (elastic#20170) Fix terminating pod autodiscover issue (elastic#20084) Call host parser only once when building light metricsets (elastic#20149) [CI] fix null string with contains (elastic#20182) [Ingest Manager] Fix failing unit tests on windows (elastic#20127) [Filebeat] Update crowdstrike module (elastic#20138) [docs] Add x-pack role to relevant metricsets (elastic#20167) ...
* Improve stop to be more relaxed. * Add changelog.
…elastic#20202) * Improve stop to be more relaxed. * Add changelog. (cherry picked from commit 3bbbb19)
What does this PR do?
It allows the GPRC client protocol to just disconnect when receiving expected state of Stopping. If the client was connected then disconnects, that is accepted as valid signal that the application has stopped.
Why is it important?
Because sometimes the TCP connection will not be flushed on disconnect and the Agent will not get the Stopping message.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Related issues