Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark TEP 0002 as implementable #143

Merged
merged 1 commit into from
Jul 9, 2020

Conversation

imjasonh
Copy link
Member

@imjasonh imjasonh commented Jul 7, 2020

As described in #142

The remaining open questions are relatively small and should not block initial implementation steps, as they can be discussed before implementing, or during review of implementation PRs.

/assign @afrittoli
/assign @vdemeester

@tekton-robot tekton-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 7, 2020
@vdemeester
Copy link
Member

The remaining open questions are relatively small and should not block initial implementation steps, as they can be discussed before implementing, or during review of implementation PRs.

We need to list those question somewhere thought :) to make sure we don't forget to answer those 😉

Document open questions that should be resolved before/during
implementation
@tekton-robot tekton-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 7, 2020
@imjasonh
Copy link
Member Author

imjasonh commented Jul 7, 2020

We need to list those question somewhere thought :) to make sure we don't forget to answer those 😉

Good call. Added to the bottom of the TEP: https://github.com/tektoncd/community/pull/143/files#diff-05e21e3acc67d41d85eb757b7c95ec9e

(let me know if I missed any!)

Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of this! Not sure how many of us need to approve - @vdemeester took an action to look into a policy around this - @vdemeester maybe we can start with using the same policy we have in the pipelines repo? https://github.com/tektoncd/pipeline/blob/master/api_compatibility_policy.md#approving-api-changes

(if not, we should update that policy also)


# Open Questions

* Should Tekton's controller be responsible for updating `Run`s' `.status.conditions` in the case of cancellation and timeout (as it does when enforcing [initial update timeout](#initial-update-timeout)), or should these updates be the sole responsibility of Custom Task authors?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo ooo ooo i have an opinion here

how about following the same pattern as we do with TaskRuns and PipelineRuns: we require a "cancel" field in the spec of the Run, and Tekton controller is responsible for setting that field

having more than one controller potentially operating on the status feels like a recipe for race conditions and confusing behavior, but this way the responsibilities are clear: the Tekton controller can express things it wants via the spec, and the status expresses what actually happens (as realized by the custom controller)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real crux of the question is how a Pipeline should react to a badly-behaving Custom Task controller.

In the TEP today, I think we generally agree that the initial update timeout is reasonable -- please let me know if you disagree. This guards against entirely absent controllers (e.g., they're not installed, or they're crash-looping). A PipelineRun that includes a Run that we think is never going to see an update should just force-fail the offending Run.

The open question is how we should respond when a Run is cancelled (.spec.status=Cancelled) but hasn't reported that it's finished (.status.conditions[Succeeded]=False). When a PipelineRun is cancelled it should propagate .spec.status=Cancelled, but if it doesn't also set the .status, does it wait for the external controller to update it? How long should it wait? Having this cancelling-but-not-yet-finished state adds a possibly-confusing new node to the state diagram, and introduces room for bugs.

Same question for timeout -- what should the PipelineRun controller do when it can see one of its Runs has been running longer than its defined .spec.timeout, but hasn't reported that it's finished yet? Just wait? How long? What if the external controller isn't listening anymore?

My personal opinion, given the above, is that Tekton should take responsibility for as much as it can, since it removes room for undefined behavior in the face of badly-behaving external controllers. It'll already be responsible for force-failing "abandoned" Runs. It can also take responsibility for reacting to .spec.status=Cancelled by also setting .status.conditions, and for noticing when .spec.timeout has elapsed and set .status.conditions. This prevents the case where a controller never marks a Run as complete -- eventually, the timeout will be enforced by Tekton, and PipelineRun execution can at least proceed.

The downside, as you've noted, is distributed responsibility, and therefore race conditions. I think one solution to this is enforcing (via webhook validation) that a Run's .status.conditions cannot transition from finished to ongoing -- this will prevent a laggy external controller from reporting a Run that was previously considered complete to now be ongoing. If we don't prevent that transition, what should a PipelineRun controller do when it observes that nonsense transition anyway? I think we should also prevent state transitions from one flavor of finished to another (e.g., prevent a Run that's now Successful to become Failed) for the same reason -- if a PipelineRun has reported that it's finished successfully, how should it react when all of its constituent Runs now say they completed successfully?

Ultimately I think it's a judgement call between trusting external controllers to be fairly well-behaved (something we can help along with well-behaved forkable samples), or not completely trusting them and coding defensively around the possibility that they'll do something wonky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The open question is how we should respond when a Run is cancelled (.spec.status=Cancelled) but hasn't reported that it's finished (.status.conditions[Succeeded]=False). When a PipelineRun is cancelled it should propagate .spec.status=Cancelled, but if it doesn't also set the .status, does it wait for the external controller to update it? How long should it wait?

Couple of other ideas:

  • Have a timeout on how long after setting .spec.status=Cancelled we expect .status.conditions to be updated (ew more timeouts)
  • Don't wait at all: mark .spec.status=Cancelled, don't wait for the status.condition to get updated (might be reasonable to do the same in the taskrun case as well) - this has some downsides in that the pipelinerun would be considered done while something is still running, but i think it might be easier to reason about
  • Be okay with a poorly behaved controller keeping a PipelineRun running indefinitely: this seems okay to me - if you made a bug in your controller such that you arent cancelling correctly, forcing the status into cancelled wont necessarily solve the problem (there's no guarantee the custom controller would do what we expect when this happens)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a timeout on how long after setting .spec.status=Cancelled we expect .status.conditions to be updated (ew more timeouts)

Yeah, I'm currently at "ew more timeouts" but I'd love to hear more feedback. In general having a state between "running" and "fully cancelled" feels like it will be confusing to users, and confusing to our own code.

Don't wait at all: mark .spec.status=Cancelled, don't wait for the status.condition to get updated (might be reasonable to do the same in the taskrun case as well) - this has some downsides in that the pipelinerun would be considered done while something is still running, but i think it might be easier to reason about

This also adds a new state -- .spec.status=Cancelled-but-.status.conditions-is-ongoing-and-PipelineRun-ignores-it. Having a state where a PipelineRun willfully ignores a Run's .status.conditions feels like it will make things harder to reason about; today we consider .status.conditions as the definitive answer about whether something is ongoing/blocking, and that's been nice so far.

Be okay with a poorly behaved controller keeping a PipelineRun running indefinitely: this seems okay to me - if you made a bug in your controller such that you arent cancelling correctly, forcing the status into cancelled wont necessarily solve the problem (there's no guarantee the custom controller would do what we expect when this happens)

The problem there is that it's not the controller author that has a bad time, it's the user left wondering why their PipelineRun is spinning forever, and they might think it's Tekton's fault and not their external task implementor's fault. (And they'd kind of be right -- we let the external task author provide them a bad experience!)

Forcing the Run to cancelled doesn't necessarily propagate that signal downstream -- we still can't do anything abou that -- but it at least keeps the PipelineRun on track doing what it should be doing. And the lack of downstream cancel propagation is easier to blame on the custom task author, since Tekton's done all it can to send that signal and keep the PipelineRun chugging along.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm convinced about the first 2 options!

The problem there is that it's not the controller author that has a bad time, it's the user left wondering why their PipelineRun is spinning forever, and they might think it's Tekton's fault and not their external task implementor's fault. (And they'd kind of be right -- we let the external task author provide them a bad experience!)

We aren't responsible for the experience that the external task author gave - i think that's the crux of this debate: imo by opening up this flexibility, it's by design not Tekton's responsibility. We define the expected behavior, we can even provide tests to help verify this contract, but I think working around a buggy implementation is beyond the scope of what Tekton should do.

If the PipelineRun keeps spinning forever b/c the custom task controller doesnt update the status, i think that's an accurate representation of the state of the world.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, and I think that's a good distillation of the central issue.

Ultimately the well-behavedness logic is going to go somewhere. It'll either go in Tekton-core, or in the sample-task implementation which we hope/expect most integrators to use to get started. If it's in Tekton-core we can be more sure that all controller are well-behaved, and if it goes in sample-task we can just say, if you didn't build on our preferred/blessed base, or if you changed it to remove the training wheels, well that's on you. I'm pretty okay with either, to be honest. 🤷‍♂️

The question then becomes, do we want any sanity checks enforced by Tekton-core? Under the current TEP, Tekton will do an initial-update check, which guards against totally absent/crashing controllers. That's easy enough to drop, but then how should users diagnose a perma-spinning PipelineRun? Check that the custom-task controller pod isn't CrashLoopBackoff?

We could also enforce valid state transitions, e.g., that a Run shouldn't be able to:

  • transition from done -> running
  • transition from success <-> failure
  • transition from cancelled (.spec.status=Cancelled) to un-cancelled
  • change its Reason (or Message?) after it's finished

...and if such a thing ever did happen, how should a PipelineRun behave if it happened? Preventing these states seems simpler to me than figuring out how to handle them. (These could also be prevented for TaskRuns and PipelineRuns, but today... aren't 🤷‍♂️ -- this could prevent some bugs and race conditions in Tekton-core today)

teps/0002-custom-tasks.md Show resolved Hide resolved

* Package name and helper methods included in `tektoncd/pipeline` repo to aid Custom Task authors writing their controllers in Go; and should we expect them to use `knative/pkg` helpers too?

* Versioning and release cadence and ownership of `tektoncd/sample-task` repo; will it be released/versioned alongside `tektoncd/pipeline`?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is sample-task different from the package name and helper methods above?

i suggest we start releasing and versioning it separately to get a feel for when it actually needs updates and then decide based on what we see happening (if there is no dependency b/w the two, coupling them seems arbitrary)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's two things I'd like us to provide:

  1. A package of helper methods in tektoncd/pipeline, versioned with Tekton releases
  2. A sample-task template repo that's forkable, and can give authors a good starting point to build new custom tasks. This would necessarily use some of the helpers in (1). It probably doesn't need to be versioned (authors can just fork from HEAD), but I think there's an open question about whether it should only depend on tagged/released tektoncd/pipeline code, or from HEAD (probably the former?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A package of helper methods in tektoncd/pipeline, versioned with Tekton releases

when you say helpers, what kinda stuff do you have in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FilterRunRef, which is used by a controller to request a filtered list of Run updates for only those that match their specified apiVersion+kind

And some reflect-based param-placeholder-replacing helper that would likely either reuse large chunks of existing Tekton code, or be so generic as to be directly usable as TaskRun/PipelineRun placeholder resolving directly.

Beyond that nothing jumps to mind, but we might come up with more in the course of writing more custom tasks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Do you see the Pipelines controller using FilterRunRef? It seems kinda like it might only be used by the custom controller 🤔

I'm actually starting to wonder if it will make sense in the long run to have the Run type and supporting libs like this together in the separate repo, imported by tekton pipelines

(not something we have to decide right away but just wondering)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see Tekton's own controllers using FilterRunRef, since if any part of Tekton cares about Runs (tbd, see other thread) then it will care about all of them, and not just those of a given ref-type.

The main reason for putting it in the tektoncd/pipeline repo is to avoid repo proliferation and interdependency hell. We could have a tektoncd/pkg that only provides helper methods for custom task authors, but then it would basically just have these two methods, and since the param-replacement method would hopefully be used by tektoncd/pipeline itself, it would have to import that repo, which means it would have to be released/versioned in cadence with tektoncd/pipeline, and I figured ugh let's just put it in the main repo directly. Also setting up CI and OWNERS for this new repo, ugh.

FWIW the PR to add FilterRunRef to tektoncd/pipeline is here: tektoncd/pipeline#2915


* Versioning and release cadence and ownership of `tektoncd/sample-task` repo; will it be released/versioned alongside `tektoncd/pipeline`?

* Support for "unnamed" Tasks -- i.e., `Run`s that reference an `apiVersion` and `kind`, but not a `name`. A Custom Task author could either use this to provide "default" behavior where a Task CR doesn't need to be defined, or could not define a CRD at all and only support functionality specified by params. Examples of this are `CEL` and `Wait` tasks that just accept a param for `expression` or `duration`, and don't require defining a `CEL` or `Wait` CRD type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should include this - i spose it doesnt have to be in the first iteration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in general I agree, but it hasn't been discussed any more widely than between you and I so I wanted to keep it as an open question in case others had feedback.

@bobcatfish
Copy link
Contributor

We've merged a PR associated with this, we should probably merge this with implementable status @vdemeester @afrittoli ?

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it is implementable, we may need to update when we have answer to the open questions, but as is, implementable seems good 👌

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 9, 2020
@bobcatfish
Copy link
Contributor

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 9, 2020
@tekton-robot tekton-robot merged commit c33e054 into tektoncd:master Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants