Mark TEP 0002 as implementable #143

imjasonh · 2020-07-07T13:10:38Z

As described in #142

The remaining open questions are relatively small and should not block initial implementation steps, as they can be discussed before implementing, or during review of implementation PRs.

/assign @afrittoli
/assign @vdemeester

vdemeester · 2020-07-07T13:14:02Z

The remaining open questions are relatively small and should not block initial implementation steps, as they can be discussed before implementing, or during review of implementation PRs.

We need to list those question somewhere thought :) to make sure we don't forget to answer those 😉

Document open questions that should be resolved before/during implementation

imjasonh · 2020-07-07T13:28:43Z

We need to list those question somewhere thought :) to make sure we don't forget to answer those 😉

Good call. Added to the bottom of the TEP: https://github.com/tektoncd/community/pull/143/files#diff-05e21e3acc67d41d85eb757b7c95ec9e

(let me know if I missed any!)

bobcatfish

I approve of this! Not sure how many of us need to approve - @vdemeester took an action to look into a policy around this - @vdemeester maybe we can start with using the same policy we have in the pipelines repo? https://github.com/tektoncd/pipeline/blob/master/api_compatibility_policy.md#approving-api-changes

(if not, we should update that policy also)

bobcatfish · 2020-07-07T13:33:16Z

teps/0002-custom-tasks.md

+
+# Open Questions
+
+* Should Tekton's controller be responsible for updating `Run`s' `.status.conditions` in the case of cancellation and timeout (as it does when enforcing [initial update timeout](#initial-update-timeout)), or should these updates be the sole responsibility of Custom Task authors?


ooo ooo ooo i have an opinion here

how about following the same pattern as we do with TaskRuns and PipelineRuns: we require a "cancel" field in the spec of the Run, and Tekton controller is responsible for setting that field

having more than one controller potentially operating on the status feels like a recipe for race conditions and confusing behavior, but this way the responsibilities are clear: the Tekton controller can express things it wants via the spec, and the status expresses what actually happens (as realized by the custom controller)

The real crux of the question is how a Pipeline should react to a badly-behaving Custom Task controller.

In the TEP today, I think we generally agree that the initial update timeout is reasonable -- please let me know if you disagree. This guards against entirely absent controllers (e.g., they're not installed, or they're crash-looping). A PipelineRun that includes a Run that we think is never going to see an update should just force-fail the offending Run.

The open question is how we should respond when a Run is cancelled (.spec.status=Cancelled) but hasn't reported that it's finished (.status.conditions[Succeeded]=False). When a PipelineRun is cancelled it should propagate .spec.status=Cancelled, but if it doesn't also set the .status, does it wait for the external controller to update it? How long should it wait? Having this cancelling-but-not-yet-finished state adds a possibly-confusing new node to the state diagram, and introduces room for bugs.

Same question for timeout -- what should the PipelineRun controller do when it can see one of its Runs has been running longer than its defined .spec.timeout, but hasn't reported that it's finished yet? Just wait? How long? What if the external controller isn't listening anymore?

My personal opinion, given the above, is that Tekton should take responsibility for as much as it can, since it removes room for undefined behavior in the face of badly-behaving external controllers. It'll already be responsible for force-failing "abandoned" Runs. It can also take responsibility for reacting to .spec.status=Cancelled by also setting .status.conditions, and for noticing when .spec.timeout has elapsed and set .status.conditions. This prevents the case where a controller never marks a Run as complete -- eventually, the timeout will be enforced by Tekton, and PipelineRun execution can at least proceed.

The downside, as you've noted, is distributed responsibility, and therefore race conditions. I think one solution to this is enforcing (via webhook validation) that a Run's .status.conditions cannot transition from finished to ongoing -- this will prevent a laggy external controller from reporting a Run that was previously considered complete to now be ongoing. If we don't prevent that transition, what should a PipelineRun controller do when it observes that nonsense transition anyway? I think we should also prevent state transitions from one flavor of finished to another (e.g., prevent a Run that's now Successful to become Failed) for the same reason -- if a PipelineRun has reported that it's finished successfully, how should it react when all of its constituent Runs now say they completed successfully?

Ultimately I think it's a judgement call between trusting external controllers to be fairly well-behaved (something we can help along with well-behaved forkable samples), or not completely trusting them and coding defensively around the possibility that they'll do something wonky.

The open question is how we should respond when a Run is cancelled (.spec.status=Cancelled) but hasn't reported that it's finished (.status.conditions[Succeeded]=False). When a PipelineRun is cancelled it should propagate .spec.status=Cancelled, but if it doesn't also set the .status, does it wait for the external controller to update it? How long should it wait?

Couple of other ideas:

Have a timeout on how long after setting .spec.status=Cancelled we expect .status.conditions to be updated (ew more timeouts)

Don't wait at all: mark .spec.status=Cancelled, don't wait for the status.condition to get updated (might be reasonable to do the same in the taskrun case as well) - this has some downsides in that the pipelinerun would be considered done while something is still running, but i think it might be easier to reason about

Be okay with a poorly behaved controller keeping a PipelineRun running indefinitely: this seems okay to me - if you made a bug in your controller such that you arent cancelling correctly, forcing the status into cancelled wont necessarily solve the problem (there's no guarantee the custom controller would do what we expect when this happens)

Have a timeout on how long after setting .spec.status=Cancelled we expect .status.conditions to be updated (ew more timeouts)

Yeah, I'm currently at "ew more timeouts" but I'd love to hear more feedback. In general having a state between "running" and "fully cancelled" feels like it will be confusing to users, and confusing to our own code.

Don't wait at all: mark .spec.status=Cancelled, don't wait for the status.condition to get updated (might be reasonable to do the same in the taskrun case as well) - this has some downsides in that the pipelinerun would be considered done while something is still running, but i think it might be easier to reason about

This also adds a new state -- .spec.status=Cancelled-but-.status.conditions-is-ongoing-and-PipelineRun-ignores-it. Having a state where a PipelineRun willfully ignores a Run's .status.conditions feels like it will make things harder to reason about; today we consider .status.conditions as the definitive answer about whether something is ongoing/blocking, and that's been nice so far.

Be okay with a poorly behaved controller keeping a PipelineRun running indefinitely: this seems okay to me - if you made a bug in your controller such that you arent cancelling correctly, forcing the status into cancelled wont necessarily solve the problem (there's no guarantee the custom controller would do what we expect when this happens)

The problem there is that it's not the controller author that has a bad time, it's the user left wondering why their PipelineRun is spinning forever, and they might think it's Tekton's fault and not their external task implementor's fault. (And they'd kind of be right -- we let the external task author provide them a bad experience!)

Forcing the Run to cancelled doesn't necessarily propagate that signal downstream -- we still can't do anything abou that -- but it at least keeps the PipelineRun on track doing what it should be doing. And the lack of downstream cancel propagation is easier to blame on the custom task author, since Tekton's done all it can to send that signal and keep the PipelineRun chugging along.

I'm convinced about the first 2 options!

The problem there is that it's not the controller author that has a bad time, it's the user left wondering why their PipelineRun is spinning forever, and they might think it's Tekton's fault and not their external task implementor's fault. (And they'd kind of be right -- we let the external task author provide them a bad experience!)

We aren't responsible for the experience that the external task author gave - i think that's the crux of this debate: imo by opening up this flexibility, it's by design not Tekton's responsibility. We define the expected behavior, we can even provide tests to help verify this contract, but I think working around a buggy implementation is beyond the scope of what Tekton should do.

If the PipelineRun keeps spinning forever b/c the custom task controller doesnt update the status, i think that's an accurate representation of the state of the world.

That's fair, and I think that's a good distillation of the central issue.

Ultimately the well-behavedness logic is going to go somewhere. It'll either go in Tekton-core, or in the sample-task implementation which we hope/expect most integrators to use to get started. If it's in Tekton-core we can be more sure that all controller are well-behaved, and if it goes in sample-task we can just say, if you didn't build on our preferred/blessed base, or if you changed it to remove the training wheels, well that's on you. I'm pretty okay with either, to be honest. 🤷‍♂️

The question then becomes, do we want any sanity checks enforced by Tekton-core? Under the current TEP, Tekton will do an initial-update check, which guards against totally absent/crashing controllers. That's easy enough to drop, but then how should users diagnose a perma-spinning PipelineRun? Check that the custom-task controller pod isn't CrashLoopBackoff?

We could also enforce valid state transitions, e.g., that a Run shouldn't be able to:

transition from done -> running

transition from success <-> failure

transition from cancelled (.spec.status=Cancelled) to un-cancelled

change its Reason (or Message?) after it's finished

...and if such a thing ever did happen, how should a PipelineRun behave if it happened? Preventing these states seems simpler to me than figuring out how to handle them. (These could also be prevented for TaskRuns and PipelineRuns, but today... aren't 🤷‍♂️ -- this could prevent some bugs and race conditions in Tekton-core today)

teps/0002-custom-tasks.md

bobcatfish · 2020-07-07T13:35:58Z

teps/0002-custom-tasks.md

+
+* Package name and helper methods included in `tektoncd/pipeline` repo to aid Custom Task authors writing their controllers in Go; and should we expect them to use `knative/pkg` helpers too?
+
+* Versioning and release cadence and ownership of `tektoncd/sample-task` repo; will it be released/versioned alongside `tektoncd/pipeline`?


is sample-task different from the package name and helper methods above?

i suggest we start releasing and versioning it separately to get a feel for when it actually needs updates and then decide based on what we see happening (if there is no dependency b/w the two, coupling them seems arbitrary)

Yeah, there's two things I'd like us to provide:

A package of helper methods in tektoncd/pipeline, versioned with Tekton releases

A sample-task template repo that's forkable, and can give authors a good starting point to build new custom tasks. This would necessarily use some of the helpers in (1). It probably doesn't need to be versioned (authors can just fork from HEAD), but I think there's an open question about whether it should only depend on tagged/released tektoncd/pipeline code, or from HEAD (probably the former?)

A package of helper methods in tektoncd/pipeline, versioned with Tekton releases

when you say helpers, what kinda stuff do you have in mind?

FilterRunRef, which is used by a controller to request a filtered list of Run updates for only those that match their specified apiVersion+kind

And some reflect-based param-placeholder-replacing helper that would likely either reuse large chunks of existing Tekton code, or be so generic as to be directly usable as TaskRun/PipelineRun placeholder resolving directly.

Beyond that nothing jumps to mind, but we might come up with more in the course of writing more custom tasks.

Interesting! Do you see the Pipelines controller using FilterRunRef? It seems kinda like it might only be used by the custom controller 🤔

I'm actually starting to wonder if it will make sense in the long run to have the Run type and supporting libs like this together in the separate repo, imported by tekton pipelines

(not something we have to decide right away but just wondering)

I don't see Tekton's own controllers using FilterRunRef, since if any part of Tekton cares about Runs (tbd, see other thread) then it will care about all of them, and not just those of a given ref-type.

The main reason for putting it in the tektoncd/pipeline repo is to avoid repo proliferation and interdependency hell. We could have a tektoncd/pkg that only provides helper methods for custom task authors, but then it would basically just have these two methods, and since the param-replacement method would hopefully be used by tektoncd/pipeline itself, it would have to import that repo, which means it would have to be released/versioned in cadence with tektoncd/pipeline, and I figured ugh let's just put it in the main repo directly. Also setting up CI and OWNERS for this new repo, ugh.

FWIW the PR to add FilterRunRef to tektoncd/pipeline is here: tektoncd/pipeline#2915

bobcatfish · 2020-07-07T13:36:35Z

teps/0002-custom-tasks.md

+
+* Versioning and release cadence and ownership of `tektoncd/sample-task` repo; will it be released/versioned alongside `tektoncd/pipeline`?
+
+* Support for "unnamed" Tasks -- i.e., `Run`s that reference an `apiVersion` and `kind`, but not a `name`. A Custom Task author could either use this to provide "default" behavior where a Task CR doesn't need to be defined, or could not define a CRD at all and only support functionality specified by params. Examples of this are `CEL` and `Wait` tasks that just accept a param for `expression` or `duration`, and don't require defining a `CEL` or `Wait` CRD type.


i think we should include this - i spose it doesnt have to be in the first iteration

Yeah in general I agree, but it hasn't been discussed any more widely than between you and I so I wanted to keep it as an open question in case others had feedback.

bobcatfish · 2020-07-08T21:56:43Z

We've merged a PR associated with this, we should probably merge this with implementable status @vdemeester @afrittoli ?

vdemeester

Indeed, it is implementable, we may need to update when we have answer to the open questions, but as is, implementable seems good 👌

tekton-robot · 2020-07-09T08:39:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bobcatfish · 2020-07-09T20:24:13Z

/lgtm

tekton-robot assigned afrittoli and vdemeester Jul 7, 2020

tekton-robot requested review from bobcatfish and kimsterv July 7, 2020 13:10

tekton-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 7, 2020

Mark TEP 0002 as implementable

8b71004

Document open questions that should be resolved before/during implementation

imjasonh force-pushed the custom-tasks branch from e93fce5 to 8b71004 Compare July 7, 2020 13:27

tekton-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 7, 2020

bobcatfish reviewed Jul 7, 2020

View reviewed changes

vdemeester approved these changes Jul 9, 2020

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 9, 2020

tekton-robot assigned bobcatfish Jul 9, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 9, 2020

tekton-robot merged commit c33e054 into tektoncd:master Jul 9, 2020

dlorenc mentioned this pull request Jul 18, 2020

Abstract task and nested tasks tektoncd/pipeline#1796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark TEP 0002 as implementable #143

Mark TEP 0002 as implementable #143

imjasonh commented Jul 7, 2020

vdemeester commented Jul 7, 2020

imjasonh commented Jul 7, 2020

bobcatfish left a comment

bobcatfish Jul 7, 2020

imjasonh Jul 7, 2020

bobcatfish Jul 7, 2020

imjasonh Jul 7, 2020

bobcatfish Jul 8, 2020

imjasonh Jul 8, 2020

bobcatfish Jul 7, 2020

imjasonh Jul 7, 2020

bobcatfish Jul 7, 2020

imjasonh Jul 7, 2020

bobcatfish Jul 8, 2020

imjasonh Jul 9, 2020

bobcatfish Jul 7, 2020

imjasonh Jul 7, 2020

bobcatfish commented Jul 8, 2020

vdemeester left a comment

tekton-robot commented Jul 9, 2020

bobcatfish commented Jul 9, 2020


		# Open Questions

		* Should Tekton's controller be responsible for updating `Run`s' `.status.conditions` in the case of cancellation and timeout (as it does when enforcing [initial update timeout](#initial-update-timeout)), or should these updates be the sole responsibility of Custom Task authors?


		* Package name and helper methods included in `tektoncd/pipeline` repo to aid Custom Task authors writing their controllers in Go; and should we expect them to use `knative/pkg` helpers too?

		* Versioning and release cadence and ownership of `tektoncd/sample-task` repo; will it be released/versioned alongside `tektoncd/pipeline`?


		* Versioning and release cadence and ownership of `tektoncd/sample-task` repo; will it be released/versioned alongside `tektoncd/pipeline`?

		* Support for "unnamed" Tasks -- i.e., `Run`s that reference an `apiVersion` and `kind`, but not a `name`. A Custom Task author could either use this to provide "default" behavior where a Task CR doesn't need to be defined, or could not define a CRD at all and only support functionality specified by params. Examples of this are `CEL` and `Wait` tasks that just accept a param for `expression` or `duration`, and don't require defining a `CEL` or `Wait` CRD type.

Mark TEP 0002 as implementable #143

Mark TEP 0002 as implementable #143

Conversation

imjasonh commented Jul 7, 2020

vdemeester commented Jul 7, 2020

imjasonh commented Jul 7, 2020

bobcatfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobcatfish commented Jul 8, 2020

vdemeester left a comment

Choose a reason for hiding this comment

tekton-robot commented Jul 9, 2020

bobcatfish commented Jul 9, 2020