-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where pod monitoring stops working after a while for Kubernetes agent #836
Conversation
This pull request has been linked to Shortcut Story #71688: Refactor job monitoring code to be long running. |
This pull request has been linked to Shortcut Story #73237: Pod monitoring seems to fail sometimes and afterwards it leaves the deployment hanging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM
@@ -33,6 +35,21 @@ public KubernetesPodMonitor(IKubernetesPodService podService, ISystemLog log) | |||
} | |||
|
|||
async Task IKubernetesPodMonitor.StartAsync(CancellationToken cancellationToken) | |||
{ | |||
var maxDurationSeconds = 70; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be a const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
[TestFixture] | ||
public class ExponentialBackoffFixture | ||
{ | ||
readonly Foo foo = new Foo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't touch that, its a load bearing Foo
😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha... I extracted that to a different class ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you need a SaST approval LGTM
Although I only looked at things outside of the k8s folder
[sc-73237]
Background
#824 starting using a Watcher to get updates for Pods. Unfortunately the Kubernetes API Server will close the connection after some time (we can't have the watch open forever).
Results
This PR adds a short timeout of 10min and reloads the status after that.
We also add a try/catch around the entire status loading so we don't silently lose track of Pod statuses.
Screenshot of Pod status loading at startup:
Status is reloaded after 10min:
How to review this PR
Quality ✔️
Pre-requisites