Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actually stop trying to time out finished Runs ⏰ #3078

Merged
merged 1 commit into from
Sep 9, 2020

Conversation

bobcatfish
Copy link
Collaborator

Changes

In 10b6427 I got really enthusiastic about making sure even our
reads were threadsafe and so I thought I would be clever and,
instead of accessing attributes of a PipelineRun or TaskRun in
a go routine, use a value that wouldn't change - specifically the address.

But the address will change between reconcile loops, because the
reconcile logic will create a new instance of the Run object every time!
🤦‍♀️

Fortunately this doesn't cause any serious problems, it just makes
things slightly less efficient: for every Run you start, a go routine
will remain open until the timeout occurs, and when it fires, it will be
reconciled an extra time, even if it has completed. (In fact keeping
this functionality completed and dropping the "done" map might be a
reasonable option!)

With this change, we now return to using the namespace + name as a key
in the map that tracks the done channels; we pass these by value so that
reads will be threadsafe.

Instead of fixing this separately for the TaskRun and PipelineRun
functions, I've collapsed these and the callback into one. Each handler
instantiates its own timeout handler so there is no reason for the
timeout handler to have special knowledge of one vs the other.

Fixes #3047

Test

I tried several different approaches to add a test case that would
reveal the underlying problem but I now feel like it's more hassle than
it's worth. Approaches:

  1. instantiate the controller in the reconciler tests with a custom
    timeout handler that has been overridden to use a custom logger,
    so we can check for the log indicating the timeout handler completed
  2. Similar to (1) but instead of checking logs, just pass in a custom
    done channel and wait for it to close

Both 1 + 2 require changing the way that NewController works, i.e. the
way we always instantiate controllers. I tried working around this by
taking the same approach as TestHandlePodCreationError and
instantiating my own Reconciler but it a) wasn't instantiated properly
no matter what I tried (trying to use it created panics) and b) had a
confusingly different interface, exposing ReconcileKind instead of
Reconcile

I tried some other approaches but these went nowhere either; I don't
think it's worth adding a test to cover this, but if folks feel strongly
I don't mind opening an issue at least to continue to explore it? I feel
that this bug is one that is very specific to the implementation and I'm
not sure how valuable a test that covers it would be. If we do pursue
it, we might want to do it at the level of an end to end test that
actually checks the logs from a real running controller.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes tests (if functionality changed/added)
  • [n/a] Includes docs (if user facing)
  • Commit messages follow commit message best practices
  • Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

When a TaskRun or PipelineRun completes, the go routine waiting for it to timeout will now stop (as it was designed to do!) instead of always re-reon

@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Aug 7, 2020
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 7, 2020
@bobcatfish bobcatfish added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/timeout/handler.go 77.4% 76.1% -1.3

@bobcatfish
Copy link
Collaborator Author

image

coverage dropped a bit here but note I'm next planning to resurrect #3031 and significantly increase coverage!

@bobcatfish
Copy link
Collaborator Author

pkg/timeout/handler.go:40:2: `defaultFunc` is unused (deadcode)
	defaultFunc        = func(i interface{}) {}

oooo nice thanks linter 🙏

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/timeout/handler.go 77.4% 76.1% -1.3

@bobcatfish
Copy link
Collaborator Author

image

well that seems apt :D

@bobcatfish
Copy link
Collaborator Author

Weird that the v1alpha1 test failed and v1beta1 succeeded 🤔

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on the key used, otherwise sgtm

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 17, 2020
@bobcatfish
Copy link
Collaborator Author

i dont want to introduce new flakiness but i do want to see if this happens consistently, havent been able to repro yet:

/test pull-tekton-pipeline-integration-tests

@bobcatfish
Copy link
Collaborator Author

I looked into that test failure more - I'm pretty disturbed a test directly related to the changes im making just to happened to flake but it kinda seems like that happened.

The test that fails runs a pipeline with a task that is expected to succeed and a task that is expected to time out; the task that failed was actually the one that is supposed to succeed with:

        ---
        apiVersion: tekton.dev/v1alpha1
        kind: TaskRun
        metadata:
          annotations:
            pipeline.tekton.dev/release: devel
          creationTimestamp: "2020-08-07T21:38:02Z"
          generation: 1
          labels:
            app.kubernetes.io/managed-by: tekton-pipelines
            tekton.dev/pipeline: pipelinetasktimeout
            tekton.dev/pipelineRun: prtasktimeout
            tekton.dev/pipelineTask: pipelinetask1
            tekton.dev/task: success
          name: prtasktimeout-pipelinetask1-ljfbw
          namespace: arendelle-fml4b
          ownerReferences:
          - apiVersion: tekton.dev/v1beta1
            blockOwnerDeletion: true
            controller: true
            kind: PipelineRun
            name: prtasktimeout
            uid: a6f55b45-9020-49aa-8129-cc3ce44fc9e9
          resourceVersion: "4998"
          selfLink: /apis/tekton.dev/v1alpha1/namespaces/arendelle-fml4b/taskruns/prtasktimeout-pipelinetask1-ljfbw
          uid: 237e86b7-d275-4a73-953b-81ada233c195
        spec:
          resources: {}
          serviceAccountName: ""
          taskRef:
            kind: Task
            name: success
          timeout: 1m0s
        status:
          completionTime: "2020-08-07T21:39:22Z"
          conditions:
          - lastTransitionTime: "2020-08-07T21:39:22Z"
            message: TaskRun "prtasktimeout-pipelinetask1-ljfbw" failed to finish within "1m0s"
            reason: TaskRunTimeout
            status: "False"
            type: Succeeded
          podName: prtasktimeout-pipelinetask1-ljfbw-pod-nxbns
          startTime: "2020-08-07T21:38:10Z"
          steps:
          - container: step-unnamed-0
            name: unnamed-0
            waiting:
              reason: PodInitializing
          taskSpec:
            steps:
            - args:
              - 1s
              command:
              - sleep
              image: busybox
              name: ""
              resources: {}

The task is supposed to sleep for 1s (with a 1 min timeout) and then succeed, but the pod was stuck "PodInitializing" for reasons now lost in the sands of time.

So it doesn't SEEM like this is a real problem 🤞 🤞 🤞

@bobcatfish
Copy link
Collaborator Author

/test pull-tekton-pipeline-integration-tests

@vdemeester vdemeester added kind/bug Categorizes issue or PR as related to a bug. and removed kind/bug Categorizes issue or PR as related to a bug. labels Sep 8, 2020
@bobcatfish
Copy link
Collaborator Author

I'm pretty disturbed a test directly related to the changes im making just to happened to flake but it kinda seems like that happened.

Thinking about it further, i think if an e2e test is gonna flake, the probability that a timeout related test would flake might actually be pretty high - there are a couple of them, run for both v1alpha1 and v1beta1

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a chance we can simplify a lot of this by using impl.EnqueueAfter to enqueue a reconciliation in the future when the timeout should be elapsed. But this fix lgtm even without that cleanup.

if pr.GetRunKey() != expectedKey {
t.Fatalf("Expected taskrun key to be %s but got %s", expectedKey, pr.GetRunKey())
func TestGetNamespacedName(t *testing.T) {
pr := tb.PipelineRun("prunname", tb.PipelineRunNamespace("foo"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pr := &v1beta1.PipelineRun{
  ObjectMeta: metav1.ObjectMeta{Namespace: "foo", Name: "prunname"},
}

Please 🙏

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk i can update as per #3178

thanks @imjasonh !

expectedKey := fmt.Sprintf("TaskRun/%p", tr)
if tr.GetRunKey() != expectedKey {
t.Fatalf("Expected taskrun key to be %s but got %s", expectedKey, tr.GetRunKey())
tr := tb.TaskRun("trunname", tb.TaskRunNamespace("foo"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pr := &v1beta1.TaskRun{
  ObjectMeta: metav1.ObjectMeta{Namespace: "foo", Name: "prunname"},
}

🙏

@bobcatfish
Copy link
Collaborator Author

I think there's a chance we can simplify a lot of this by using impl.EnqueueAfter to enqueue a reconciliation in the future when the timeout should be elapsed.

iiiiinteresting - i can try that out as part of #2905. i guess the only downside is that we get 1 extra reconcile for every Run? (which ironically is the current state, which this PR is undoing) seems worth it to me tho!

In 10b6427 I got really enthusiastic about making sure even our
reads were threadsafe and so I thought I would be clever and,
instead of accessing attributes of a PipelineRun or TaskRun in
a go routine, use a value that wouldn't change - specifically the address.

But the address will change between reconcile loops, because the
reconcile logic will create a new instance of the Run object every time!
🤦‍♀️

Fortunately this doesn't cause any serious problems, it just makes
things slightly less efficient: for every Run you start, a go routine
will remain open until the timeout occurs, and when it fires, it will be
reconciled an extra time, even if it has completed. (In fact keeping
this functionality completed and dropping the "done" map might be a
reasonable option!)

With this change, we now return to using the namespace + name as a key
in the map that tracks the done channels; we pass these by value so that
reads will be threadsafe.

Instead of fixing this separately for the TaskRun and PipelineRun
functions, I've collapsed these and the callback into one. Each handler
instantiates its own timeout handler so there is no reason for the
timeout handler to have special knowledge of one vs the other.

Fixes tektoncd#3047

_Test_

I tried several different approaches to add a test case that would
reveal the underlying problem but I now feel like it's more hassle than
it's worth. Approaches:

1. instantiate the controller in the reconciler tests with a custom
   timeout handler that has been overridden to use a custom logger,
   so we can check for the log indicating the timeout handler completed
2. Similar to (1) but instead of checking logs, just pass in a custom
   done channel and wait for it to close

Both 1 + 2 require changing the way that NewController works, i.e. the
way we always instantiate controllers. I tried working around this by
taking the same approach as `TestHandlePodCreationError` and
instantiating my own Reconciler but it a) wasn't instantiated properly
no matter what I tried (trying to use it created panics) and b) had a
confusingly different interface, exposing ReconcileKind instead of
Reconcile

I tried some other approaches but these went nowhere either; I don't
think it's worth adding a test to cover this, but if folks feel strongly
I don't mind opening an issue at least to continue to explore it? I feel
that this bug is one that is very specific to the implementation and I'm
not sure how valuable a test that covers it would be. If we do pursue
it, we might want to do it at the level of an end to end test that
actually checks the logs from a real running controller.
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/timeout/handler.go 77.4% 76.1% -1.3

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2020
@imjasonh
Copy link
Member

imjasonh commented Sep 9, 2020

/approve

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ImJasonH

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2020
@tekton-robot tekton-robot merged commit ee41ce3 into tektoncd:master Sep 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Done" channel (for tracking timeouts) never fires for Runs
4 participants