Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move backoff logic into own struct and increase test coverage of timeout logic #3031

Closed
wants to merge 2 commits into from

Conversation

bobcatfish
Copy link
Collaborator

Changes

Separate backoff logic from timeout logic ⏰

The timeout handler was trying to do two different jobs:

  • Track timeouts for executing TaskRuns and PipelineRuns; when they
    timeout, they should be re-reconciled, at which point they would
    fail b/c they took to long
  • Track backoffs when TaskRun pod creation fails due to resource quota
    issues. In this case we want to retry pod creation (by re-reconciling)
    after a backoff in case the resource issue has been resolved.

The logic for these 2 jobs was combined because they both share at their
heart very similar logic: the logic for creating go routines that wait
until a period of time has passed.

Now these two jobs are done by 2 different structs, which both use the
new Timer struct and logic, so they can share this logic, without having
to mix the code on top.

Future things we can do to make this even better:

  • Add more test coverage for the new structs
  • The "Handler" struct (which may have too generic a name at this point)
    probably doesn't need to have separate logic for pipelineruns +
    taskruns since each of these controllers will instanitate it
    separately
  • As discussed in The timeoutHandler is only instructed to wait when it creates pods #2905, knative/pkg's controller logic will attempt
    to re-reconcile all the Runs on start up, so we probably don't need
    to be explicitly doing this as well

Add more timeout test coverage and fix timeout "bug" 🐞

Add more tests to increase coverage of timeout package. This means
updating some code to be more unit testable, specifically injecting the
function used to determine the current time into Backoff.

This revealed a "bug" where the backoff logic wasn't using a default
value when the TaskRun timeout isn't set - however the defaulting logic
ensures that it is always set, so if anything, maybe we should actually
panic in this case.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • [n/a] Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

It almost seems like this is fixing a bug, but because of how the defaulting logic works, I don't think it really is.

NONE

@tekton-robot tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label Jul 29, 2020
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign bobcatfish
You can assign the PR to them by writing /assign @bobcatfish in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 29, 2020
@bobcatfish bobcatfish added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Jul 29, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/controller.go 96.6% 96.7% 0.1
pkg/reconciler/taskrun/taskrun.go 78.3% 78.4% 0.1
pkg/timeout/backoff.go Do not exist 93.9%
pkg/timeout/handler.go 77.4% 84.0% 6.6
pkg/timeout/timer.go Do not exist 100.0%

@bobcatfish
Copy link
Collaborator Author

oooo a data race!

==================
WARNING: DATA RACE
Read at 0x00c000c3c198 by goroutine 103:
  reflect.typedmemmove()
      /usr/local/go/src/runtime/mbarrier.go:177 +0x0
  reflect.packEface()
      /usr/local/go/src/reflect/value.go:119 +0x126
  reflect.valueInterface()
      /usr/local/go/src/reflect/value.go:1023 +0x1b9
  reflect.Value.Interface()
      /usr/local/go/src/reflect/value.go:993 +0x3c27
  fmt.(*pp).printValue()
      /usr/local/go/src/fmt/print.go:726 +0x3c28
  fmt.(*pp).printValue()
      /usr/local/go/src/fmt/print.go:880 +0x2709
  fmt.(*pp).printArg()
      /usr/local/go/src/fmt/print.go:716 +0x25a
  fmt.(*pp).doPrintf()
      /usr/local/go/src/fmt/print.go:1030 +0x311
  fmt.Sprintf()
      /usr/local/go/src/fmt/print.go:219 +0x73
  go.uber.org/zap.(*SugaredLogger).log()
      /home/prow/go/src/github.com/tektoncd/pipeline/vendor/go.uber.org/zap/sugar.go:230 +0x175
  go.uber.org/zap.(*SugaredLogger).Errorf()
      /home/prow/go/src/github.com/tektoncd/pipeline/vendor/go.uber.org/zap/sugar.go:148 +0x331
  github.com/tektoncd/pipeline/pkg/timeout.(*Backoff).SetTimer()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/timeout/backoff.go:142 +0x29a

Previous write at 0x00c000c3c198 by goroutine 79:
  knative.dev/pkg/apis/duck/v1beta1.(*Status).SetConditions()
      /home/prow/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/apis/duck/v1beta1/status_types.go:84 +0x43
  github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1.(*TaskRunStatus).SetConditions()
      <autogenerated>:1 +0x61
  knative.dev/pkg/apis.conditionsImpl.SetCondition()
      /home/prow/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/apis/condition_set.go:208 +0x3ee
  knative.dev/pkg/apis.(*conditionsImpl).SetCondition()
      <autogenerated>:1 +0x149
  github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1.(*TaskRunStatus).SetCondition()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1/taskrun_types.go:253 +0x43c
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).handlePodCreationError()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:470 +0x144
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestHandlePodCreationError.func1()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:2100 +0xd5
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:992 +0x1eb

Goroutine 103 (running) created at:
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).handlePodCreationError()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:467 +0x52a
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestHandlePodCreationError.func1()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:2100 +0xd5
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:992 +0x1eb

Goroutine 79 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1043 +0x660
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestHandlePodCreationError()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:2099 +0x12bd
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:992 +0x1eb

@bobcatfish
Copy link
Collaborator Author

I think this race condition might be another instance of #1307 🤔

The timeout handler was trying to do two different jobs:
- Track timeouts for executing TaskRuns and PipelineRuns; when they
  timeout, they should be re-reconciled, at which point they would
  fail b/c they took to long
- Track backoffs when TaskRun pod creation fails due to resource quota
  issues. In this case we want to retry pod creation (by re-reconciling)
  after a backoff in case the resource issue has been resolved.

The logic for these 2 jobs was combined because they both share at their
heart very similar logic: the logic for creating go routines that wait
until a period of time has passed.

Now these two jobs are done by 2 different structs, which both use the
new Timer struct and logic, so they can share this logic, without having
to mix the code on top.

Future things we can do to make this even better:
- Add more test coverage for the new structs
- The "Handler" struct (which may have too generic a name at this point)
  probably doesn't need to have separate logic for pipelineruns +
  taskruns since each of these controllers will instanitate it
  separately
- As discussed in tektoncd#2905, knative/pkg's controller logic will attempt
  to re-reconcile all the Runs on start up, so we probably don't need
  to be explicitly doing this as well
Add more tests to increase coverage of timeout package. This means
updating some code to be more unit testable, specifically injecting the
function used to determine the current time into Backoff.

This revealed a "bug" where the backoff logic wasn't using a default
value when the TaskRun timeout isn't set - however the defaulting logic
ensures that it is always set, so if anything, maybe we should actually
panic in this case.
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/controller.go 96.6% 96.7% 0.1
pkg/reconciler/taskrun/taskrun.go 78.3% 78.4% 0.1
pkg/timeout/backoff.go Do not exist 93.9%
pkg/timeout/handler.go 77.4% 84.0% 6.6
pkg/timeout/timer.go Do not exist 100.0%

@tekton-robot
Copy link
Collaborator

tekton-robot commented Jul 30, 2020

@bobcatfish: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-tekton-pipeline-unit-tests 7da38ae link /test tekton-pipeline-unit-tests
pull-tekton-pipeline-build-tests 7da38ae link /test pull-tekton-pipeline-build-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

// 1. Stop signal, 2. completion or 3. a given Duration to elapse.
func (b *Backoff) SetTimer(runObj StatusKey, d time.Duration) {
if b.timeoutCallback == nil {
b.logger.Errorf("attempted to set a timer for %q but no callback has been assigned", runObj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very odd (at least to me 😉), commenting out this logger from here fixes race condition. This function invocation within handlePodCreationError and the following SetCondition are somehow causing this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: DATA RACE
Read at 0x00c000c3c198 by goroutine 103:
...
go.uber.org/zap.(*SugaredLogger).log()
...

Previous write at 0x00c000c3c198 by goroutine 79:
  knative.dev/pkg/apis/duck/v1beta1.(*Status).SetConditions()
...

Copy link

@ghost ghost Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, well spotted! It looks like swapping out the %q for a %p in this log statement also removes the error. Using %q results in a string-formatted struct being printed, which probably walks the entire object and accesses all the fields. My hunch is that the %q formatter is not threadsafe in that case? In hindsight this may be why we have the StatusKey interface - the GetRunKey() func of TaskRun returns a thread-safe descriptor.

Copy link
Member

@pritidesai pritidesai Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, yup it does walk the entire object. The next log statement (using GetRunKey() on runObj) looks safe and just prints the address of taskRun object instead.

"TaskRun/0xc00015af00"

@bobcatfish
Copy link
Collaborator Author

While trying to figure this out I think I found a bigger problem, I'm going to close this PR for now b/c I'll need to rebase it anyway.

My hunch is that the %q formatter is not threadsafe in that case? In hindsight this may be why we have the StatusKey interface - the GetRunKey() func of TaskRun returns a thread-safe descriptor.

In fact I think this is a bug that I introduced - the address of the Run might be thread safe, but it's not going to be the same on every reconcile! So I think that actually every single Run is timing out because the done channel can never be signaled 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Denotes a PR that doesnt merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants