Tweak the timing of considering pod as failing to be more permissive #764

johnSchnake · 2019-06-20T16:50:42Z

What this PR does / why we need it:
If we are too aggressive with the timeframe, the main plugin container
may terminate and, while the sidecar is reporting the results and
the aggregator is reading them, the aggregator may instead too quickly
decide to consider the plugin a failure (because it hasn't seen the
final results yet).

This change just tries to extend some timing windows to give the
aggregator more time to see these results.

Lastly, it also adds logging into the worker when using retries. This
helps us better track what is happening on the worker's side.

Which issue(s) this PR fixes
Mitigates #759

Special notes for your reviewer:

Release note:

Timeframe extended (from 1m to 5m) for the aggregator to see results from a plugin before considering it to have terminated without reporting results. Timeframe also extended (from 15s to 2m) for which the aggregator will wait for clients to retry sending results if there is a problem in the connection.

codecov-io · 2019-06-20T17:33:26Z

Codecov Report

Merging #764 into master will increase coverage by 0.09%.
The diff coverage is 63.15%.

@@            Coverage Diff             @@
##           master     #764      +/-   ##
==========================================
+ Coverage   42.59%   42.69%   +0.09%     
==========================================
  Files          70       70              
  Lines        4029     4043      +14     
==========================================
+ Hits         1716     1726      +10     
- Misses       2208     2211       +3     
- Partials      105      106       +1

Impacted Files	Coverage Δ
pkg/plugin/driver/utils/utils.go	`31.57% <ø> (ø)`	⬆️
pkg/plugin/driver/daemonset/daemonset.go	`48.73% <0%> (ø)`	⬆️
pkg/plugin/aggregation/run.go	`23.27% <100%> (ø)`	⬆️
pkg/plugin/driver/job/job.go	`56.17% <100%> (ø)`	⬆️
pkg/plugin/aggregation/aggregator.go	`78.86% <100%> (ø)`	⬆️
pkg/worker/request.go	`59.09% <60%> (+5.75%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e5103d...947cd2b. Read the comment docs.

johnSchnake · 2019-06-20T17:50:25Z

pkg/plugin/aggregation/aggregator.go

@@ -261,7 +261,7 @@ func (a *Aggregator) IngestResults(ctx context.Context, resultsCh <-chan *plugin
 		case result, more = <-resultsCh:
 		}

-		if !more {
+		if result == nil && !more {


Should always be the same as before, but wasn't sure if there was ever, somehow, that you'd get the result, !more instead of nil, !more. Just an extra clear check that we will process each result.

johnSchnake · 2019-06-20T17:53:19Z

pkg/plugin/driver/utils/utils.go

+	// terminatedContainerWindow is the amount of time after a plugins main container terminates
+	// that we consider it a failure mode. This handles the situation where the plugin container
+	// exits without returning results.
+	terminatedContainerWindow = 5 * time.Minute


Increased 500%; agreed with @zubron 's comment offline that we should try and base this on some evidence/calculation rather than just my wild guesses. But FWIW I've successfully hand results reported with an AWS cluster even with the 1m window so I assume 5m is OK.

If we are too aggressive with the timeframe, the main plugin container may terminate and, while the sidecar is reporting the results and the aggregator is reading them, the aggregator may instead too quickly decide to consider the plugin a failure (because it hasn't seen the final results yet). This change just tries to extend some timing windows to give the aggregator more time to see these results. Lastly, it also adds logging into the worker when using retries. This helps us better track what is happening on the worker's side. Mitigates #759 Signed-off-by: John Schnake <jschnake@vmware.com>

johnSchnake · 2019-06-21T14:52:52Z

Rebased; will wait for green tests again just to be safe.

johnSchnake requested a review from zubron June 20, 2019 16:50

johnSchnake commented Jun 20, 2019

View reviewed changes

zubron approved these changes Jun 20, 2019

View reviewed changes

johnSchnake merged commit 41ce5d4 into vmware-tanzu:master Jun 21, 2019

johnSchnake deleted the avoidTerminationReportingRace branch June 21, 2019 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak the timing of considering pod as failing to be more permissive #764

Tweak the timing of considering pod as failing to be more permissive #764

johnSchnake commented Jun 20, 2019

codecov-io commented Jun 20, 2019

johnSchnake Jun 20, 2019

johnSchnake Jun 20, 2019

johnSchnake commented Jun 21, 2019

Tweak the timing of considering pod as failing to be more permissive #764

Tweak the timing of considering pod as failing to be more permissive #764

Conversation

johnSchnake commented Jun 20, 2019

codecov-io commented Jun 20, 2019

Codecov Report

johnSchnake Jun 20, 2019

Choose a reason for hiding this comment

johnSchnake Jun 20, 2019

Choose a reason for hiding this comment

johnSchnake commented Jun 21, 2019