-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Actually stop trying to time out finished Runs ⏰ #3078
Conversation
The following is the coverage report on the affected files.
|
coverage dropped a bit here but note I'm next planning to resurrect #3031 and significantly increase coverage! |
oooo nice thanks linter 🙏 |
71229f7
to
1f73bca
Compare
The following is the coverage report on the affected files.
|
Weird that the v1alpha1 test failed and v1beta1 succeeded 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment on the key used, otherwise sgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
i dont want to introduce new flakiness but i do want to see if this happens consistently, havent been able to repro yet: /test pull-tekton-pipeline-integration-tests |
I looked into that test failure more - I'm pretty disturbed a test directly related to the changes im making just to happened to flake but it kinda seems like that happened. The test that fails runs a pipeline with a task that is expected to succeed and a task that is expected to time out; the task that failed was actually the one that is supposed to succeed with:
The task is supposed to sleep for 1s (with a 1 min timeout) and then succeed, but the pod was stuck "PodInitializing" for reasons now lost in the sands of time. So it doesn't SEEM like this is a real problem 🤞 🤞 🤞 |
/test pull-tekton-pipeline-integration-tests |
Thinking about it further, i think if an e2e test is gonna flake, the probability that a timeout related test would flake might actually be pretty high - there are a couple of them, run for both v1alpha1 and v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a chance we can simplify a lot of this by using impl.EnqueueAfter
to enqueue a reconciliation in the future when the timeout should be elapsed. But this fix lgtm even without that cleanup.
if pr.GetRunKey() != expectedKey { | ||
t.Fatalf("Expected taskrun key to be %s but got %s", expectedKey, pr.GetRunKey()) | ||
func TestGetNamespacedName(t *testing.T) { | ||
pr := tb.PipelineRun("prunname", tb.PipelineRunNamespace("foo")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pr := &v1beta1.PipelineRun{
ObjectMeta: metav1.ObjectMeta{Namespace: "foo", Name: "prunname"},
}
Please 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectedKey := fmt.Sprintf("TaskRun/%p", tr) | ||
if tr.GetRunKey() != expectedKey { | ||
t.Fatalf("Expected taskrun key to be %s but got %s", expectedKey, tr.GetRunKey()) | ||
tr := tb.TaskRun("trunname", tb.TaskRunNamespace("foo")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pr := &v1beta1.TaskRun{
ObjectMeta: metav1.ObjectMeta{Namespace: "foo", Name: "prunname"},
}
🙏
iiiiinteresting - i can try that out as part of #2905. i guess the only downside is that we get 1 extra reconcile for every Run? (which ironically is the current state, which this PR is undoing) seems worth it to me tho! |
In 10b6427 I got really enthusiastic about making sure even our reads were threadsafe and so I thought I would be clever and, instead of accessing attributes of a PipelineRun or TaskRun in a go routine, use a value that wouldn't change - specifically the address. But the address will change between reconcile loops, because the reconcile logic will create a new instance of the Run object every time! 🤦♀️ Fortunately this doesn't cause any serious problems, it just makes things slightly less efficient: for every Run you start, a go routine will remain open until the timeout occurs, and when it fires, it will be reconciled an extra time, even if it has completed. (In fact keeping this functionality completed and dropping the "done" map might be a reasonable option!) With this change, we now return to using the namespace + name as a key in the map that tracks the done channels; we pass these by value so that reads will be threadsafe. Instead of fixing this separately for the TaskRun and PipelineRun functions, I've collapsed these and the callback into one. Each handler instantiates its own timeout handler so there is no reason for the timeout handler to have special knowledge of one vs the other. Fixes tektoncd#3047 _Test_ I tried several different approaches to add a test case that would reveal the underlying problem but I now feel like it's more hassle than it's worth. Approaches: 1. instantiate the controller in the reconciler tests with a custom timeout handler that has been overridden to use a custom logger, so we can check for the log indicating the timeout handler completed 2. Similar to (1) but instead of checking logs, just pass in a custom done channel and wait for it to close Both 1 + 2 require changing the way that NewController works, i.e. the way we always instantiate controllers. I tried working around this by taking the same approach as `TestHandlePodCreationError` and instantiating my own Reconciler but it a) wasn't instantiated properly no matter what I tried (trying to use it created panics) and b) had a confusingly different interface, exposing ReconcileKind instead of Reconcile I tried some other approaches but these went nowhere either; I don't think it's worth adding a test to cover this, but if folks feel strongly I don't mind opening an issue at least to continue to explore it? I feel that this bug is one that is very specific to the implementation and I'm not sure how valuable a test that covers it would be. If we do pursue it, we might want to do it at the level of an end to end test that actually checks the logs from a real running controller.
1f73bca
to
2611ebf
Compare
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ImJasonH The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Changes
In 10b6427 I got really enthusiastic about making sure even our
reads were threadsafe and so I thought I would be clever and,
instead of accessing attributes of a PipelineRun or TaskRun in
a go routine, use a value that wouldn't change - specifically the address.
But the address will change between reconcile loops, because the
reconcile logic will create a new instance of the Run object every time!
🤦♀️
Fortunately this doesn't cause any serious problems, it just makes
things slightly less efficient: for every Run you start, a go routine
will remain open until the timeout occurs, and when it fires, it will be
reconciled an extra time, even if it has completed. (In fact keeping
this functionality completed and dropping the "done" map might be a
reasonable option!)
With this change, we now return to using the namespace + name as a key
in the map that tracks the done channels; we pass these by value so that
reads will be threadsafe.
Instead of fixing this separately for the TaskRun and PipelineRun
functions, I've collapsed these and the callback into one. Each handler
instantiates its own timeout handler so there is no reason for the
timeout handler to have special knowledge of one vs the other.
Fixes #3047
Test
I tried several different approaches to add a test case that would
reveal the underlying problem but I now feel like it's more hassle than
it's worth. Approaches:
timeout handler that has been overridden to use a custom logger,
so we can check for the log indicating the timeout handler completed
done channel and wait for it to close
Both 1 + 2 require changing the way that NewController works, i.e. the
way we always instantiate controllers. I tried working around this by
taking the same approach as
TestHandlePodCreationError
andinstantiating my own Reconciler but it a) wasn't instantiated properly
no matter what I tried (trying to use it created panics) and b) had a
confusingly different interface, exposing ReconcileKind instead of
Reconcile
I tried some other approaches but these went nowhere either; I don't
think it's worth adding a test to cover this, but if folks feel strongly
I don't mind opening an issue at least to continue to explore it? I feel
that this bug is one that is very specific to the implementation and I'm
not sure how valuable a test that covers it would be. If we do pursue
it, we might want to do it at the level of an end to end test that
actually checks the logs from a real running controller.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes