-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark TEP 0002 as implementable #143
Conversation
We need to list those question somewhere thought :) to make sure we don't forget to answer those 😉 |
Document open questions that should be resolved before/during implementation
Good call. Added to the bottom of the TEP: https://github.com/tektoncd/community/pull/143/files#diff-05e21e3acc67d41d85eb757b7c95ec9e (let me know if I missed any!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve of this! Not sure how many of us need to approve - @vdemeester took an action to look into a policy around this - @vdemeester maybe we can start with using the same policy we have in the pipelines repo? https://github.com/tektoncd/pipeline/blob/master/api_compatibility_policy.md#approving-api-changes
(if not, we should update that policy also)
|
||
# Open Questions | ||
|
||
* Should Tekton's controller be responsible for updating `Run`s' `.status.conditions` in the case of cancellation and timeout (as it does when enforcing [initial update timeout](#initial-update-timeout)), or should these updates be the sole responsibility of Custom Task authors? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooo ooo ooo i have an opinion here
how about following the same pattern as we do with TaskRuns and PipelineRuns: we require a "cancel" field in the spec of the Run, and Tekton controller is responsible for setting that field
having more than one controller potentially operating on the status feels like a recipe for race conditions and confusing behavior, but this way the responsibilities are clear: the Tekton controller can express things it wants via the spec, and the status expresses what actually happens (as realized by the custom controller)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real crux of the question is how a Pipeline should react to a badly-behaving Custom Task controller.
In the TEP today, I think we generally agree that the initial update timeout is reasonable -- please let me know if you disagree. This guards against entirely absent controllers (e.g., they're not installed, or they're crash-looping). A PipelineRun that includes a Run that we think is never going to see an update should just force-fail the offending Run.
The open question is how we should respond when a Run is cancelled (.spec.status=Cancelled
) but hasn't reported that it's finished (.status.conditions[Succeeded]=False
). When a PipelineRun is cancelled it should propagate .spec.status=Cancelled
, but if it doesn't also set the .status
, does it wait for the external controller to update it? How long should it wait? Having this cancelling-but-not-yet-finished state adds a possibly-confusing new node to the state diagram, and introduces room for bugs.
Same question for timeout
-- what should the PipelineRun controller do when it can see one of its Runs has been running longer than its defined .spec.timeout
, but hasn't reported that it's finished yet? Just wait? How long? What if the external controller isn't listening anymore?
My personal opinion, given the above, is that Tekton should take responsibility for as much as it can, since it removes room for undefined behavior in the face of badly-behaving external controllers. It'll already be responsible for force-failing "abandoned" Runs. It can also take responsibility for reacting to .spec.status=Cancelled
by also setting .status.conditions
, and for noticing when .spec.timeout
has elapsed and set .status.conditions
. This prevents the case where a controller never marks a Run as complete -- eventually, the timeout will be enforced by Tekton, and PipelineRun execution can at least proceed.
The downside, as you've noted, is distributed responsibility, and therefore race conditions. I think one solution to this is enforcing (via webhook validation) that a Run
's .status.conditions
cannot transition from finished to ongoing -- this will prevent a laggy external controller from reporting a Run that was previously considered complete to now be ongoing. If we don't prevent that transition, what should a PipelineRun controller do when it observes that nonsense transition anyway? I think we should also prevent state transitions from one flavor of finished to another (e.g., prevent a Run that's now Successful
to become Failed
) for the same reason -- if a PipelineRun has reported that it's finished successfully, how should it react when all of its constituent Runs now say they completed successfully?
Ultimately I think it's a judgement call between trusting external controllers to be fairly well-behaved (something we can help along with well-behaved forkable samples), or not completely trusting them and coding defensively around the possibility that they'll do something wonky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The open question is how we should respond when a Run is cancelled (.spec.status=Cancelled) but hasn't reported that it's finished (.status.conditions[Succeeded]=False). When a PipelineRun is cancelled it should propagate .spec.status=Cancelled, but if it doesn't also set the .status, does it wait for the external controller to update it? How long should it wait?
Couple of other ideas:
- Have a timeout on how long after setting .spec.status=Cancelled we expect .status.conditions to be updated (ew more timeouts)
- Don't wait at all: mark .spec.status=Cancelled, don't wait for the status.condition to get updated (might be reasonable to do the same in the taskrun case as well) - this has some downsides in that the pipelinerun would be considered done while something is still running, but i think it might be easier to reason about
- Be okay with a poorly behaved controller keeping a PipelineRun running indefinitely: this seems okay to me - if you made a bug in your controller such that you arent cancelling correctly, forcing the status into cancelled wont necessarily solve the problem (there's no guarantee the custom controller would do what we expect when this happens)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a timeout on how long after setting .spec.status=Cancelled we expect .status.conditions to be updated (ew more timeouts)
Yeah, I'm currently at "ew more timeouts" but I'd love to hear more feedback. In general having a state between "running" and "fully cancelled" feels like it will be confusing to users, and confusing to our own code.
Don't wait at all: mark .spec.status=Cancelled, don't wait for the status.condition to get updated (might be reasonable to do the same in the taskrun case as well) - this has some downsides in that the pipelinerun would be considered done while something is still running, but i think it might be easier to reason about
This also adds a new state -- .spec.status=Cancelled
-but-.status.conditions
-is-ongoing-and-PipelineRun-ignores-it. Having a state where a PipelineRun willfully ignores a Run's .status.conditions
feels like it will make things harder to reason about; today we consider .status.conditions
as the definitive answer about whether something is ongoing/blocking, and that's been nice so far.
Be okay with a poorly behaved controller keeping a PipelineRun running indefinitely: this seems okay to me - if you made a bug in your controller such that you arent cancelling correctly, forcing the status into cancelled wont necessarily solve the problem (there's no guarantee the custom controller would do what we expect when this happens)
The problem there is that it's not the controller author that has a bad time, it's the user left wondering why their PipelineRun is spinning forever, and they might think it's Tekton's fault and not their external task implementor's fault. (And they'd kind of be right -- we let the external task author provide them a bad experience!)
Forcing the Run
to cancelled doesn't necessarily propagate that signal downstream -- we still can't do anything abou that -- but it at least keeps the PipelineRun on track doing what it should be doing. And the lack of downstream cancel propagation is easier to blame on the custom task author, since Tekton's done all it can to send that signal and keep the PipelineRun chugging along.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm convinced about the first 2 options!
The problem there is that it's not the controller author that has a bad time, it's the user left wondering why their PipelineRun is spinning forever, and they might think it's Tekton's fault and not their external task implementor's fault. (And they'd kind of be right -- we let the external task author provide them a bad experience!)
We aren't responsible for the experience that the external task author gave - i think that's the crux of this debate: imo by opening up this flexibility, it's by design not Tekton's responsibility. We define the expected behavior, we can even provide tests to help verify this contract, but I think working around a buggy implementation is beyond the scope of what Tekton should do.
If the PipelineRun keeps spinning forever b/c the custom task controller doesnt update the status, i think that's an accurate representation of the state of the world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, and I think that's a good distillation of the central issue.
Ultimately the well-behavedness logic is going to go somewhere. It'll either go in Tekton-core, or in the sample-task implementation which we hope/expect most integrators to use to get started. If it's in Tekton-core we can be more sure that all controller are well-behaved, and if it goes in sample-task we can just say, if you didn't build on our preferred/blessed base, or if you changed it to remove the training wheels, well that's on you. I'm pretty okay with either, to be honest. 🤷♂️
The question then becomes, do we want any sanity checks enforced by Tekton-core? Under the current TEP, Tekton will do an initial-update check, which guards against totally absent/crashing controllers. That's easy enough to drop, but then how should users diagnose a perma-spinning PipelineRun? Check that the custom-task controller pod isn't CrashLoopBackoff
?
We could also enforce valid state transitions, e.g., that a Run
shouldn't be able to:
- transition from done -> running
- transition from success <-> failure
- transition from cancelled (
.spec.status=Cancelled
) to un-cancelled - change its
Reason
(orMessage
?) after it's finished
...and if such a thing ever did happen, how should a PipelineRun behave if it happened? Preventing these states seems simpler to me than figuring out how to handle them. (These could also be prevented for TaskRuns and PipelineRuns, but today... aren't 🤷♂️ -- this could prevent some bugs and race conditions in Tekton-core today)
|
||
* Package name and helper methods included in `tektoncd/pipeline` repo to aid Custom Task authors writing their controllers in Go; and should we expect them to use `knative/pkg` helpers too? | ||
|
||
* Versioning and release cadence and ownership of `tektoncd/sample-task` repo; will it be released/versioned alongside `tektoncd/pipeline`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is sample-task different from the package name and helper methods above?
i suggest we start releasing and versioning it separately to get a feel for when it actually needs updates and then decide based on what we see happening (if there is no dependency b/w the two, coupling them seems arbitrary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's two things I'd like us to provide:
- A package of helper methods in
tektoncd/pipeline
, versioned with Tekton releases - A sample-task template repo that's forkable, and can give authors a good starting point to build new custom tasks. This would necessarily use some of the helpers in (1). It probably doesn't need to be versioned (authors can just fork from HEAD), but I think there's an open question about whether it should only depend on tagged/released
tektoncd/pipeline
code, or from HEAD (probably the former?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A package of helper methods in tektoncd/pipeline, versioned with Tekton releases
when you say helpers, what kinda stuff do you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FilterRunRef
, which is used by a controller to request a filtered list of Run
updates for only those that match their specified apiVersion+kind
And some reflect-based param-placeholder-replacing helper that would likely either reuse large chunks of existing Tekton code, or be so generic as to be directly usable as TaskRun/PipelineRun placeholder resolving directly.
Beyond that nothing jumps to mind, but we might come up with more in the course of writing more custom tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! Do you see the Pipelines controller using FilterRunRef? It seems kinda like it might only be used by the custom controller 🤔
I'm actually starting to wonder if it will make sense in the long run to have the Run type and supporting libs like this together in the separate repo, imported by tekton pipelines
(not something we have to decide right away but just wondering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see Tekton's own controllers using FilterRunRef
, since if any part of Tekton cares about Run
s (tbd, see other thread) then it will care about all of them, and not just those of a given ref-type.
The main reason for putting it in the tektoncd/pipeline
repo is to avoid repo proliferation and interdependency hell. We could have a tektoncd/pkg
that only provides helper methods for custom task authors, but then it would basically just have these two methods, and since the param-replacement method would hopefully be used by tektoncd/pipeline
itself, it would have to import that repo, which means it would have to be released/versioned in cadence with tektoncd/pipeline
, and I figured ugh let's just put it in the main repo directly. Also setting up CI and OWNERS for this new repo, ugh.
FWIW the PR to add FilterRunRef
to tektoncd/pipeline
is here: tektoncd/pipeline#2915
|
||
* Versioning and release cadence and ownership of `tektoncd/sample-task` repo; will it be released/versioned alongside `tektoncd/pipeline`? | ||
|
||
* Support for "unnamed" Tasks -- i.e., `Run`s that reference an `apiVersion` and `kind`, but not a `name`. A Custom Task author could either use this to provide "default" behavior where a Task CR doesn't need to be defined, or could not define a CRD at all and only support functionality specified by params. Examples of this are `CEL` and `Wait` tasks that just accept a param for `expression` or `duration`, and don't require defining a `CEL` or `Wait` CRD type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should include this - i spose it doesnt have to be in the first iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah in general I agree, but it hasn't been discussed any more widely than between you and I so I wanted to keep it as an open question in case others had feedback.
We've merged a PR associated with this, we should probably merge this with implementable status @vdemeester @afrittoli ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it is implementable, we may need to update when we have answer to the open questions, but as is, implementable
seems good 👌
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
As described in #142
The remaining open questions are relatively small and should not block initial implementation steps, as they can be discussed before implementing, or during review of implementation PRs.
/assign @afrittoli
/assign @vdemeester