Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak the timing of considering pod as failing to be more permissive #764

Merged
merged 1 commit into from
Jun 21, 2019
Merged

Tweak the timing of considering pod as failing to be more permissive #764

merged 1 commit into from
Jun 21, 2019

Conversation

johnSchnake
Copy link
Contributor

What this PR does / why we need it:
If we are too aggressive with the timeframe, the main plugin container
may terminate and, while the sidecar is reporting the results and
the aggregator is reading them, the aggregator may instead too quickly
decide to consider the plugin a failure (because it hasn't seen the
final results yet).

This change just tries to extend some timing windows to give the
aggregator more time to see these results.

Lastly, it also adds logging into the worker when using retries. This
helps us better track what is happening on the worker's side.

Which issue(s) this PR fixes
Mitigates #759

Special notes for your reviewer:

Release note:

Timeframe extended (from 1m to 5m) for the aggregator to see results from a plugin before considering it to have terminated without reporting results. Timeframe also extended (from 15s to 2m) for which the aggregator will wait for clients to retry sending results if there is a problem in the connection.

@johnSchnake johnSchnake requested a review from zubron June 20, 2019 16:50
@codecov-io
Copy link

Codecov Report

Merging #764 into master will increase coverage by 0.09%.
The diff coverage is 63.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #764      +/-   ##
==========================================
+ Coverage   42.59%   42.69%   +0.09%     
==========================================
  Files          70       70              
  Lines        4029     4043      +14     
==========================================
+ Hits         1716     1726      +10     
- Misses       2208     2211       +3     
- Partials      105      106       +1
Impacted Files Coverage Δ
pkg/plugin/driver/utils/utils.go 31.57% <ø> (ø) ⬆️
pkg/plugin/driver/daemonset/daemonset.go 48.73% <0%> (ø) ⬆️
pkg/plugin/aggregation/run.go 23.27% <100%> (ø) ⬆️
pkg/plugin/driver/job/job.go 56.17% <100%> (ø) ⬆️
pkg/plugin/aggregation/aggregator.go 78.86% <100%> (ø) ⬆️
pkg/worker/request.go 59.09% <60%> (+5.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e5103d...947cd2b. Read the comment docs.

@@ -261,7 +261,7 @@ func (a *Aggregator) IngestResults(ctx context.Context, resultsCh <-chan *plugin
case result, more = <-resultsCh:
}

if !more {
if result == nil && !more {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should always be the same as before, but wasn't sure if there was ever, somehow, that you'd get the result, !more instead of nil, !more. Just an extra clear check that we will process each result.

// terminatedContainerWindow is the amount of time after a plugins main container terminates
// that we consider it a failure mode. This handles the situation where the plugin container
// exits without returning results.
terminatedContainerWindow = 5 * time.Minute
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increased 500%; agreed with @zubron 's comment offline that we should try and base this on some evidence/calculation rather than just my wild guesses. But FWIW I've successfully hand results reported with an AWS cluster even with the 1m window so I assume 5m is OK.

If we are too aggressive with the timeframe, the main plugin container
may terminate and, while the sidecar is reporting the results and
the aggregator is reading them, the aggregator may instead too quickly
decide to consider the plugin a failure (because it hasn't seen the
final results yet).

This change just tries to extend some timing windows to give the
aggregator more time to see these results.

Lastly, it also adds logging into the worker when using retries. This
helps us better track what is happening on the worker's side.

Mitigates #759

Signed-off-by: John Schnake <jschnake@vmware.com>
@johnSchnake
Copy link
Contributor Author

Rebased; will wait for green tests again just to be safe.

@johnSchnake johnSchnake merged commit 41ce5d4 into vmware-tanzu:master Jun 21, 2019
@johnSchnake johnSchnake deleted the avoidTerminationReportingRace branch June 21, 2019 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants