-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak the timing of considering pod as failing to be more permissive #764
Tweak the timing of considering pod as failing to be more permissive #764
Conversation
Codecov Report
@@ Coverage Diff @@
## master #764 +/- ##
==========================================
+ Coverage 42.59% 42.69% +0.09%
==========================================
Files 70 70
Lines 4029 4043 +14
==========================================
+ Hits 1716 1726 +10
- Misses 2208 2211 +3
- Partials 105 106 +1
Continue to review full report at Codecov.
|
@@ -261,7 +261,7 @@ func (a *Aggregator) IngestResults(ctx context.Context, resultsCh <-chan *plugin | |||
case result, more = <-resultsCh: | |||
} | |||
|
|||
if !more { | |||
if result == nil && !more { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should always be the same as before, but wasn't sure if there was ever, somehow, that you'd get the result, !more
instead of nil, !more
. Just an extra clear check that we will process each result.
// terminatedContainerWindow is the amount of time after a plugins main container terminates | ||
// that we consider it a failure mode. This handles the situation where the plugin container | ||
// exits without returning results. | ||
terminatedContainerWindow = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increased 500%; agreed with @zubron 's comment offline that we should try and base this on some evidence/calculation rather than just my wild guesses. But FWIW I've successfully hand results reported with an AWS cluster even with the 1m window so I assume 5m is OK.
If we are too aggressive with the timeframe, the main plugin container may terminate and, while the sidecar is reporting the results and the aggregator is reading them, the aggregator may instead too quickly decide to consider the plugin a failure (because it hasn't seen the final results yet). This change just tries to extend some timing windows to give the aggregator more time to see these results. Lastly, it also adds logging into the worker when using retries. This helps us better track what is happening on the worker's side. Mitigates #759 Signed-off-by: John Schnake <jschnake@vmware.com>
Rebased; will wait for green tests again just to be safe. |
What this PR does / why we need it:
If we are too aggressive with the timeframe, the main plugin container
may terminate and, while the sidecar is reporting the results and
the aggregator is reading them, the aggregator may instead too quickly
decide to consider the plugin a failure (because it hasn't seen the
final results yet).
This change just tries to extend some timing windows to give the
aggregator more time to see these results.
Lastly, it also adds logging into the worker when using retries. This
helps us better track what is happening on the worker's side.
Which issue(s) this PR fixes
Mitigates #759
Special notes for your reviewer:
Release note: