Actually stop trying to time out finished Runs ⏰ #3078

bobcatfish · 2020-08-07T21:06:30Z

Changes

In 10b6427 I got really enthusiastic about making sure even our
reads were threadsafe and so I thought I would be clever and,
instead of accessing attributes of a PipelineRun or TaskRun in
a go routine, use a value that wouldn't change - specifically the address.

But the address will change between reconcile loops, because the
reconcile logic will create a new instance of the Run object every time!
🤦‍♀️

Fortunately this doesn't cause any serious problems, it just makes
things slightly less efficient: for every Run you start, a go routine
will remain open until the timeout occurs, and when it fires, it will be
reconciled an extra time, even if it has completed. (In fact keeping
this functionality completed and dropping the "done" map might be a
reasonable option!)

With this change, we now return to using the namespace + name as a key
in the map that tracks the done channels; we pass these by value so that
reads will be threadsafe.

Instead of fixing this separately for the TaskRun and PipelineRun
functions, I've collapsed these and the callback into one. Each handler
instantiates its own timeout handler so there is no reason for the
timeout handler to have special knowledge of one vs the other.

Fixes #3047

Test

I tried several different approaches to add a test case that would
reveal the underlying problem but I now feel like it's more hassle than
it's worth. Approaches:

instantiate the controller in the reconciler tests with a custom
timeout handler that has been overridden to use a custom logger,
so we can check for the log indicating the timeout handler completed
Similar to (1) but instead of checking logs, just pass in a custom
done channel and wait for it to close

Both 1 + 2 require changing the way that NewController works, i.e. the
way we always instantiate controllers. I tried working around this by
taking the same approach as TestHandlePodCreationError and
instantiating my own Reconciler but it a) wasn't instantiated properly
no matter what I tried (trying to use it created panics) and b) had a
confusingly different interface, exposing ReconcileKind instead of
Reconcile

I tried some other approaches but these went nowhere either; I don't
think it's worth adding a test to cover this, but if folks feel strongly
I don't mind opening an issue at least to continue to explore it? I feel
that this bug is one that is very specific to the implementation and I'm
not sure how valuable a test that covers it would be. If we do pursue
it, we might want to do it at the level of an end to end test that
actually checks the logs from a real running controller.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
[n/a] Includes docs (if user facing)
Commit messages follow commit message best practices
Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

If you are adding a new binary/image to the cmd dir, please update
the release Task to build and release this image.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

When a TaskRun or PipelineRun completes, the go routine waiting for it to timeout will now stop (as it was designed to do!) instead of always re-reon

tekton-robot · 2020-08-07T21:09:46Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/timeout/handler.go	77.4%	76.1%	-1.3

bobcatfish · 2020-08-07T21:11:12Z

coverage dropped a bit here but note I'm next planning to resurrect #3031 and significantly increase coverage!

bobcatfish · 2020-08-07T21:17:27Z

pkg/timeout/handler.go:40:2: `defaultFunc` is unused (deadcode)
	defaultFunc        = func(i interface{}) {}

oooo nice thanks linter 🙏

tekton-robot · 2020-08-07T21:30:24Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/timeout/handler.go	77.4%	76.1%	-1.3

bobcatfish · 2020-08-07T21:54:42Z

well that seems apt :D

bobcatfish · 2020-08-07T21:55:47Z

Weird that the v1alpha1 test failed and v1beta1 succeeded 🤔

vdemeester

One comment on the key used, otherwise sgtm

pkg/apis/pipeline/v1beta1/pipelinerun_types.go

vdemeester

/lgtm

bobcatfish · 2020-09-02T15:13:53Z

i dont want to introduce new flakiness but i do want to see if this happens consistently, havent been able to repro yet:

/test pull-tekton-pipeline-integration-tests

bobcatfish · 2020-09-02T21:20:19Z

I looked into that test failure more - I'm pretty disturbed a test directly related to the changes im making just to happened to flake but it kinda seems like that happened.

The test that fails runs a pipeline with a task that is expected to succeed and a task that is expected to time out; the task that failed was actually the one that is supposed to succeed with:

        ---
        apiVersion: tekton.dev/v1alpha1
        kind: TaskRun
        metadata:
          annotations:
            pipeline.tekton.dev/release: devel
          creationTimestamp: "2020-08-07T21:38:02Z"
          generation: 1
          labels:
            app.kubernetes.io/managed-by: tekton-pipelines
            tekton.dev/pipeline: pipelinetasktimeout
            tekton.dev/pipelineRun: prtasktimeout
            tekton.dev/pipelineTask: pipelinetask1
            tekton.dev/task: success
          name: prtasktimeout-pipelinetask1-ljfbw
          namespace: arendelle-fml4b
          ownerReferences:
          - apiVersion: tekton.dev/v1beta1
            blockOwnerDeletion: true
            controller: true
            kind: PipelineRun
            name: prtasktimeout
            uid: a6f55b45-9020-49aa-8129-cc3ce44fc9e9
          resourceVersion: "4998"
          selfLink: /apis/tekton.dev/v1alpha1/namespaces/arendelle-fml4b/taskruns/prtasktimeout-pipelinetask1-ljfbw
          uid: 237e86b7-d275-4a73-953b-81ada233c195
        spec:
          resources: {}
          serviceAccountName: ""
          taskRef:
            kind: Task
            name: success
          timeout: 1m0s
        status:
          completionTime: "2020-08-07T21:39:22Z"
          conditions:
          - lastTransitionTime: "2020-08-07T21:39:22Z"
            message: TaskRun "prtasktimeout-pipelinetask1-ljfbw" failed to finish within "1m0s"
            reason: TaskRunTimeout
            status: "False"
            type: Succeeded
          podName: prtasktimeout-pipelinetask1-ljfbw-pod-nxbns
          startTime: "2020-08-07T21:38:10Z"
          steps:
          - container: step-unnamed-0
            name: unnamed-0
            waiting:
              reason: PodInitializing
          taskSpec:
            steps:
            - args:
              - 1s
              command:
              - sleep
              image: busybox
              name: ""
              resources: {}

The task is supposed to sleep for 1s (with a 1 min timeout) and then succeed, but the pod was stuck "PodInitializing" for reasons now lost in the sands of time.

So it doesn't SEEM like this is a real problem 🤞 🤞 🤞

bobcatfish · 2020-09-02T21:20:40Z

/test pull-tekton-pipeline-integration-tests

bobcatfish · 2020-09-08T16:11:15Z

I'm pretty disturbed a test directly related to the changes im making just to happened to flake but it kinda seems like that happened.

Thinking about it further, i think if an e2e test is gonna flake, the probability that a timeout related test would flake might actually be pretty high - there are a couple of them, run for both v1alpha1 and v1beta1

imjasonh

I think there's a chance we can simplify a lot of this by using impl.EnqueueAfter to enqueue a reconciliation in the future when the timeout should be elapsed. But this fix lgtm even without that cleanup.

imjasonh · 2020-09-08T18:36:39Z

pkg/apis/pipeline/v1beta1/pipelinerun_types_test.go

-	if pr.GetRunKey() != expectedKey {
-		t.Fatalf("Expected taskrun key to be %s but got %s", expectedKey, pr.GetRunKey())
+func TestGetNamespacedName(t *testing.T) {
+	pr := tb.PipelineRun("prunname", tb.PipelineRunNamespace("foo"))


pr := &v1beta1.PipelineRun{ ObjectMeta: metav1.ObjectMeta{Namespace: "foo", Name: "prunname"}, }

Please 🙏

kk i can update as per #3178

thanks @imjasonh !

imjasonh · 2020-09-08T18:37:00Z

pkg/apis/pipeline/v1beta1/taskrun_types_test.go

-	expectedKey := fmt.Sprintf("TaskRun/%p", tr)
-	if tr.GetRunKey() != expectedKey {
-		t.Fatalf("Expected taskrun key to be %s but got %s", expectedKey, tr.GetRunKey())
+	tr := tb.TaskRun("trunname", tb.TaskRunNamespace("foo"))


pr := &v1beta1.TaskRun{ ObjectMeta: metav1.ObjectMeta{Namespace: "foo", Name: "prunname"}, }

🙏

bobcatfish · 2020-09-08T19:16:11Z

I think there's a chance we can simplify a lot of this by using impl.EnqueueAfter to enqueue a reconciliation in the future when the timeout should be elapsed.

iiiiinteresting - i can try that out as part of #2905. i guess the only downside is that we get 1 extra reconcile for every Run? (which ironically is the current state, which this PR is undoing) seems worth it to me tho!

In 10b6427 I got really enthusiastic about making sure even our reads were threadsafe and so I thought I would be clever and, instead of accessing attributes of a PipelineRun or TaskRun in a go routine, use a value that wouldn't change - specifically the address. But the address will change between reconcile loops, because the reconcile logic will create a new instance of the Run object every time! 🤦‍♀️ Fortunately this doesn't cause any serious problems, it just makes things slightly less efficient: for every Run you start, a go routine will remain open until the timeout occurs, and when it fires, it will be reconciled an extra time, even if it has completed. (In fact keeping this functionality completed and dropping the "done" map might be a reasonable option!) With this change, we now return to using the namespace + name as a key in the map that tracks the done channels; we pass these by value so that reads will be threadsafe. Instead of fixing this separately for the TaskRun and PipelineRun functions, I've collapsed these and the callback into one. Each handler instantiates its own timeout handler so there is no reason for the timeout handler to have special knowledge of one vs the other. Fixes tektoncd#3047 _Test_ I tried several different approaches to add a test case that would reveal the underlying problem but I now feel like it's more hassle than it's worth. Approaches: 1. instantiate the controller in the reconciler tests with a custom timeout handler that has been overridden to use a custom logger, so we can check for the log indicating the timeout handler completed 2. Similar to (1) but instead of checking logs, just pass in a custom done channel and wait for it to close Both 1 + 2 require changing the way that NewController works, i.e. the way we always instantiate controllers. I tried working around this by taking the same approach as `TestHandlePodCreationError` and instantiating my own Reconciler but it a) wasn't instantiated properly no matter what I tried (trying to use it created panics) and b) had a confusingly different interface, exposing ReconcileKind instead of Reconcile I tried some other approaches but these went nowhere either; I don't think it's worth adding a test to cover this, but if folks feel strongly I don't mind opening an issue at least to continue to explore it? I feel that this bug is one that is very specific to the implementation and I'm not sure how valuable a test that covers it would be. If we do pursue it, we might want to do it at the level of an end to end test that actually checks the logs from a real running controller.

tekton-robot · 2020-09-09T15:23:21Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/timeout/handler.go	77.4%	76.1%	-1.3

imjasonh

/lgtm

imjasonh · 2020-09-09T15:46:58Z

/approve

tekton-robot · 2020-09-09T15:47:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ImJasonH

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ImJasonH]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Aug 7, 2020

tekton-robot requested review from afrittoli and imjasonh August 7, 2020 21:06

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 7, 2020

bobcatfish added the kind/bug Categorizes issue or PR as related to a bug. label Aug 7, 2020

bobcatfish force-pushed the done_doesnt_work branch from 71229f7 to 1f73bca Compare August 7, 2020 21:27

vdemeester reviewed Aug 10, 2020

View reviewed changes

pkg/apis/pipeline/v1beta1/pipelinerun_types.go Show resolved Hide resolved

vdemeester reviewed Aug 17, 2020

View reviewed changes

tekton-robot assigned vdemeester Aug 17, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 17, 2020

vdemeester added kind/bug Categorizes issue or PR as related to a bug. and removed kind/bug Categorizes issue or PR as related to a bug. labels Sep 8, 2020

imjasonh reviewed Sep 8, 2020

View reviewed changes

bobcatfish force-pushed the done_doesnt_work branch from 1f73bca to 2611ebf Compare September 9, 2020 15:19

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2020

bobcatfish mentioned this pull request Sep 9, 2020

The timeoutHandler is only instructed to wait when it creates pods #2905

Closed

imjasonh reviewed Sep 9, 2020

View reviewed changes

tekton-robot assigned imjasonh Sep 9, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 9, 2020

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 9, 2020

tekton-robot merged commit ee41ce3 into tektoncd:master Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actually stop trying to time out finished Runs ⏰ #3078

Actually stop trying to time out finished Runs ⏰ #3078

bobcatfish commented Aug 7, 2020

tekton-robot commented Aug 7, 2020

bobcatfish commented Aug 7, 2020

bobcatfish commented Aug 7, 2020

tekton-robot commented Aug 7, 2020

bobcatfish commented Aug 7, 2020

bobcatfish commented Aug 7, 2020

vdemeester left a comment

vdemeester left a comment

bobcatfish commented Sep 2, 2020

bobcatfish commented Sep 2, 2020

bobcatfish commented Sep 2, 2020

bobcatfish commented Sep 8, 2020

imjasonh left a comment

imjasonh Sep 8, 2020

bobcatfish Sep 8, 2020

imjasonh Sep 8, 2020

bobcatfish commented Sep 8, 2020

tekton-robot commented Sep 9, 2020

imjasonh left a comment

imjasonh commented Sep 9, 2020

tekton-robot commented Sep 9, 2020

Actually stop trying to time out finished Runs ⏰ #3078

Actually stop trying to time out finished Runs ⏰ #3078

Conversation

bobcatfish commented Aug 7, 2020

Changes

Test

Submitter Checklist

Reviewer Notes

Release Notes

tekton-robot commented Aug 7, 2020

bobcatfish commented Aug 7, 2020

bobcatfish commented Aug 7, 2020

tekton-robot commented Aug 7, 2020

bobcatfish commented Aug 7, 2020

bobcatfish commented Aug 7, 2020

vdemeester left a comment

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

bobcatfish commented Sep 2, 2020

bobcatfish commented Sep 2, 2020

bobcatfish commented Sep 2, 2020

bobcatfish commented Sep 8, 2020

imjasonh left a comment

Choose a reason for hiding this comment

imjasonh Sep 8, 2020

Choose a reason for hiding this comment

bobcatfish Sep 8, 2020

Choose a reason for hiding this comment

imjasonh Sep 8, 2020

Choose a reason for hiding this comment

bobcatfish commented Sep 8, 2020

tekton-robot commented Sep 9, 2020

imjasonh left a comment

Choose a reason for hiding this comment

imjasonh commented Sep 9, 2020

tekton-robot commented Sep 9, 2020