-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
periodically disconnect from acs #3586
Conversation
193a8e5
to
67ca497
Compare
That's rather desirable! Curious if we were able to verify this behavior? (Agent sent close message, handled another payload message, and closed connection upon receiving close echo from ACS) |
by queuing up the close message, we don't delay processing it but rather do it so that agent will first read other payload messages. as soon as the close message is read, agent processes it and terminates/reconnects the connection. makes sense? |
5ca6dc2
to
75f55bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the context and the write up on the testing. Looks good.
92acecf
to
12cf0b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rereviewing after rebase and added tests.
Summary
This PR introduces a connection time property to the ACS session object. With this change, the agent will periodically disconnect from ACS (every 15-45 minutes).
Currently we rely on ACS to periodically disconnect from the agent, unless the agent stops receiving heartbeats from ACS. About every ~1-2 minutes, if the agent does not receive any heartbeats and if no activity occurs, the agent closes its connection to ACS.
If an ACS host is unhealthy (i.e. it looses track of its active connections and doesn't disconnect periodically anymore) and that we can no longer rely on heartbeats, the agent will be stuck with a stale connection. This may lead to task credentials not being refreshed or no task payload being served by this container instance. With this change, we would no longer have a single point of failure (i.e. unhealthy ACS host) and we let the agent control its fate.
Implementation details
connectionTimer
is started when the ACS websocket connection is established.io.EOF
back to the parent go routine that started the ACS session. This is in-line with how it returns anio.EOF
when ACS initiates the connection close. Referenceio.EOF
, the websocket connection is established immediately without backoff.Testing
New tests cover the changes: yes
Description for the changelog
Enhancement: periodically disconnect from ACS
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.