-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0040 - ignore step error #302
Conversation
/kind tep |
7141d27
to
91eafc4
Compare
I'd really like to take a look at this before we merge it - I think there's a lot of overlap with the use case listed here and the goals we have for the PipelineResources revamp (specifically around task specialization) /assign bobcatfish |
The code of the PoC is at https://github.com/afrittoli/pipeline/commits/tep_0040 (the last two commits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about the problem(s) we are trying to address here. There are a few related problems and it feels like this is kinda starting to address but it's not totally clear:
- Allow steps to fail (Add a field to Step that allows it to ignore failed prior Steps *within the same Task* pipeline#1559) - progress on this one has been stalled since @imjasonh pointed out it's not clear that a single task is the best solution and a pipeline with multiple Tasks might make more sense: Add a field to Step that allows it to ignore failed prior Steps *within the same Task* pipeline#1559 (comment)
- "off the shelf" images <-- this seems like an interesting problem on it's own and is the motivation behind TEP 11. id really like to understand some examples of this a bit better
- It's interesting to me that we are talking about providing the exit code itself - up until this point, and in proposals like TEP 28 we've discussed failure vs success but never at the level of the error code before. Why is it important to this proposal, or is this just a case of wanting to understand if a step has failed?
- I'm a bit confused about what level we want this at: is it important that steps within a Task be allowed to fail and other steps can see their status, or is it important that Tasks within a Pipeline be able to fail and provide their exit code as part of their failure? (some use cases might help clear this up) <-- if the main motivation here is to allow tasks in a pipeline to fail and for something to act on that failure, id like to understand more about why finally Tasks dont do the trick
- The use case listed in this TEP (reporting the failure of unit tests) doesn't seem well suited to this solution: a pipeline with a finally task seems like it could do the same job? baking everything into one task interferes with the task's resusability (and is one of the motivations of pipelineresources and task specialization
I think the key thing is I'm trying to understand what this provides that we don't get with finally tasks + TEP 28.
teps/0040-capture-step-exit-code.md
Outdated
an option to capture non-zero exit code. The pipeline author can choose to continue execution after capturing | ||
the non-zero exit code and make it available to access it after the execution. | ||
|
||
Issue: [tektoncd/pipeline#2800](https://github.com/tektoncd/pipeline/issues/2800) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might also be covering tektoncd/pipeline#1559 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, even though the approach there is slightly different. The focus on that one is to let a step run even though there was a previous failure. Here is on letting the potentially failing step declare that the failure should be captured, so other steps don't need to be aware about the failure at all.
teps/0040-capture-step-exit-code.md
Outdated
As a pipeline author, I would like to design a task with multiple steps. One of the steps is running an | ||
enterprise image to run unit tests, and the next step needs to report test results even after a previous | ||
step results in failure due to tests failure. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you go into more detail with the use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One example could be the nightly release pipeline. As part of the release pipeline we might want to run a security scan, unit tests and other checks. None of this checks should stop the release pipeline from executing, however we want to run the tests so that we keep a track of which failures we saw against which version.
This can be sometime achieved by wrapping the step that runs the tests with a script, but in other cases it would require to create a new container image that contains a shell. In cases where wrapping with a script is possible, it may still be undesirable.
This use case is strengthened by the case of users migrating to Tekton. They may have scripts / steps from their previous CI system where it was possible to ignore a failure, and they would have to modify them and/or wrap them to trap the exit code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our discussion the other day: the nightly release pipeline example is kind of different b/c that's about allowing an individual Task to fail; this proposal is about allowing steps within Tasks to fail. If we wanted to apply this proposal to our nightly release pipeline for example, we'd have to modify the golang-test
task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to modify a Task
is the main downside of this approach, we should capture it in the TEP.
The use case I'm facing is to migrate users to Tekton, they already have tasks written, and they need to be moved to Tekton, so it is not an issue in this case.
A possible workaround for existing Tasks, they could expose the capture flag through a task parameter, to allow the Task to use in both modes, with and without capture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"we'd have to modify the golang-test task" ... wow I somehow hadn't considered that. All of our legacy cases are with teams providing an "image" that when run might "fail", but in Tekton where we're promoting a Task Catalog teams might provide a Task that we should equivalently let a Pipeline decide if it should fail. I think maybe we should consider support ignoring Task failure in this TEP or at least immediately create another TEP to look at adding support.
teps/0040-capture-step-exit-code.md
Outdated
## Motivation | ||
|
||
It should be possible to easily use off-the-shelves (OTS) images as steps in Tekton tasks. A task author has no | ||
control on the image but may desire to capture an error and continue executing the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you provide some examples of images that cannot be used today? or what it would look like to try to use them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the main issue with these off-the-shelf images that they sometimes don't have a shell? this was the motivation for TEP-0011 as well
(if the images have a shell, you can write your own script to do whatever you want with the exit code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, the most difficult case is that of images that do not include a shell.
The wrapping script feels to me like a workaround though, I think it would be a nicer user experience if we allowed natively to trap the exit code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another detail we could include in the TEP: many projects have to create "debug" versions of their images to support users that want shells; in CD automation especially there are so many scenarios where you do need shells that often folks end up just having to always use the debug version
The wrapping script feels to me like a workaround though
could you explain a bit more? It feels to me like being able to run a script that runs the command that needs to be run gives the user the most flexibility, e.g. they can grab the exit code, they can grab stdout + stderr, they can do simple processing like grep and awk, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Container images are not necessarily written with a wrapping script in mind.
If an image sets an entrypoint, it is usually the recommended way to use the image. To write a script that wraps the container, we need to inspect the container image, find out what the entry point is, and run the command in our script the same way the entry point would have, which is not something we should ask Task author to be doing. If the entrypoint changes, the task will have to be updated too.
We could have a mechanism that fetches the entrypoint from the docker image automatically and runs it... oh, wait, we already do, it's Tekton's entrypoint. This TEP proposes to add a flag in the entrypoint to capture the exit code - I don't think we should do this instead in a shell script that will be executed by entrypoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk - maybe we can discuss this more when we start looking at possible designs
teps/0040-capture-step-exit-code.md
Outdated
Steps are executed in order in which they are specified. One single step failure results in a task failure | ||
in turn results in a pipeline failure. Once a step results in failure, rest of the steps are not executed. | ||
|
||
Many common pipelines have requirement where a step failure must not halt the entire execution. In order to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im a bit confused about whether this problem exists at the pipeline level or at the step level, is it possible to dig into it a bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure originates in a step and propagates to the entire pipeline.
In terms of granularity, the proposal of this TEP is to trap the exit code for specific steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we have two separate scenarios, maybe we want to handle both, but I'm not convinced both apply to the use cases we're discussing:
- I want to stop the failure of a Task in a Pipeline from stopping the execution of a Pipeline (and maybe stop it from failing as well)
- I want to stop the failure of a step in a Task from failing the Task
The failure originates in a step and propagates to the entire pipeline.
Right, there is a relationship between the two - you could potentially use (2) to give you (1) as well, but (2) doesn't let you have any control over this at the pipeline level, it has to be baked into the Task. I'd argue that to truly satisfy (1) we'd want that control at the pipeline level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might eventually want support for ignoring both "step" and "task" failure.
Thanks @bobcatfish for thorough review 😻
My understanding is taskrun and pipelinerun execution status are of type conditions (implementing duck type) vs step execution states are limited to container states with TaskRun status:
TaskRun status with a list of steps:
The duck typed conditions are not available at the step level (we could implement them if needed), and therefore exit codes are so important to this proposal. A non zero exit code of a container means a step failed. Failed step with |
It is important that steps within a task be allowed to fail and other steps in the same task can access the exit code. This proposal is limited to step failure and does not propose allowing a task to fail. A step capturing exit code ignores step failure and declares that step as a success (exit with zero) which results in successful task and continues executing rest of the tasks in a pipeline. With this proposal, a task failure cannot be ignored without capturing a step exit code and a single task failure still results in pipeline failure. |
Maybe we can dig into this a bit more. This feels to me like a few related but different features:
( @afrittoli 's demo seemed to go a bit further than 3 by also exposing the exit code as a result of the Task but it sounds like that might be beyond the scope of this TEP? in the issue tektoncd/pipeline#2800 @afrittoli mentions "It may be desirable for the pipeline to continue even if there was a failure." so I feel like there might be a bit of a mismatch between this TEP and that issue 🤔 ) Before we get into (2) and (3) I'd like to understand a bit better why we need (1). At a glance it makes sense, however once we start to dig into it (e.g. @imjasonh 's comment tektoncd/pipeline#1559 (comment)) it seems like the use cases motivating this are much better suited to a Pipeline with multiple Tasks. If we can identify some use cases that need (1) and why a Pipeline with multiple Tasks doesn't work for them I think that will really help. |
...and ultimately, of the whole pipeline, since today a step failure implies failure of the task and of the overall pipeline
In fact what this TEP provides is a way to trap or capture the failure, so the step does not actually fail from a Tekton POV.
The reason I added the original exit code as a result in the PoC is because normally the exit code would be available in the step status. I wanted to provide a way to still make it available even when it's captured. In fact the bit that is strictly in scope for this TEP is to expose whether the step would have failed or not, to allow for external systems to identify this cases and bring them to the attention of users.
A pipeline with multiple tasks would solve the issue because a task failure would still cause the pipeline to fail. If I want to:
and I can only capture errors at Task level, I need to run the tests in a task and push the logs in a different ones, which means I need to copy logs and results to some pipeline storage, download them from there and push them to the final storage, which means three times as much I/O. |
Thanks for the extra info @afrittoli !
Ah okay interesting! So my questions about this use case are:
Another way we could tackle this is via task specialization, i.e. if we provide a way to specify "wrapping" tasks (like in this example where there is a "finally" clause for an individual Task) - this way you could write one Task that runs the test, and another Task that pushes the results and logs, and (theoretically) execute all of that within the same pod. From the requirements you described, the only missing bit is whether the failure of the Task fails the entire Pipeline or not. (Totally understand if you don't agree with that route to tackle this, but I want to make sure we really understand what problem we're solving exactly - it kinda sounds like the main problem is: once any step in a Task fails, if you want to act on that failure, you have to do it in a separate pod?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to discuss this with me yesterday @afrittoli and @pritidesai ! I've added some comments with updates we could make the TEP to flesh it out based on the discussion.
My biggest overall request is that we try to frame this problem clearly without referring to the solution (yet!)
teps/0040-capture-step-exit-code.md
Outdated
- '@afrittoli' | ||
--- | ||
|
||
# TEP-0040: Capture Step Exit Code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This title feels to me like it jumps to an implementation but doesn't capture what's really motivating this.
Some ideas for alternative names: (sorry they are so wordy XD)
- Support mulitstep Tasks when some steps have no shell
- Allow Tasks to continue to act after the failure of a step
- Allow steps to fail without failing or halting Tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup I agree, I think "Allow steps to fail without failing Tasks" fits well as the biggest motivation here is to continue executing subsequent steps and preventing the task from failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
teps/0040-capture-step-exit-code.md
Outdated
As a pipeline author, I would like to design a task with multiple steps. One of the steps is running an | ||
enterprise image to run unit tests, and the next step needs to report test results even after a previous | ||
step results in failure due to tests failure. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our discussion the other day: the nightly release pipeline example is kind of different b/c that's about allowing an individual Task to fail; this proposal is about allowing steps within Tasks to fail. If we wanted to apply this proposal to our nightly release pipeline for example, we'd have to modify the golang-test
task
teps/0040-capture-step-exit-code.md
Outdated
## Motivation | ||
|
||
It should be possible to easily use off-the-shelves (OTS) images as steps in Tekton tasks. A task author has no | ||
control on the image but may desire to capture an error and continue executing the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another detail we could include in the TEP: many projects have to create "debug" versions of their images to support users that want shells; in CD automation especially there are so many scenarios where you do need shells that often folks end up just having to always use the debug version
The wrapping script feels to me like a workaround though
could you explain a bit more? It feels to me like being able to run a script that runs the command that needs to be run gives the user the most flexibility, e.g. they can grab the exit code, they can grab stdout + stderr, they can do simple processing like grep and awk, etc.
teps/0040-capture-step-exit-code.md
Outdated
enterprise image to run unit tests, and the next step needs to report test results even after a previous | ||
step results in failure due to tests failure. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a requirements section as well, e.g.:
Requirements
- Users should be able to use prebuilt images as-is without having to do one or more of the following (see also TEP-0011:
- Investigating how they are built to understand if they contain a shell and possibly overriding the entrypoint
- Build and maintain their own images (i.e. add in required shell or other binaries) from those images
- It should be possible to know that a step was allowed to fail by observing the status of the TaskRun (and PipelineRun if applicable) (e.g. to show a "warning" / display as "yellow" status in a UI)
- When a step is allowed to fail, the exit code of the process that failed should not be lost and should at a minimum be available in the status of the TaskRun (and PipelineRun if applicable)
- When a step is allowed to fail, the Task should not be marked as failed. <-- im a bit unsure about this one @afrittoli - if we want to do this, I think we need to make it possible for other steps to know that this happened and determine if the overall status should be failed, what do you think? (e.g. in the case of uploading test results, you want to keep running steps so you can do the upload, but still the overall Task has failed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bumping this
@pritidesai I slightly disagree on this last point specifically. I agree it could be interesting for steps to have access to other step outcomes, but that would be a different TEP with different use cases. For this TEP specifically, we should not plan on exposing outcome of steps within a running pipeline. |
/assign vdemeester |
/assign |
My sticking point @afrittoli is that if we don't allow other steps to see the status of the step that was allowed to fail, the Task will never be able to fail (or it would require significant workarounds to make it happen). For example in the use case where I want to run tests, then upload results, I think there is a good chance that a user will still want the overall Task to fail, they just want to be able to upload results before that happens. |
That's a good point. This TEP will work well in combination with that about capturing stdout, since with both it will be possible for the next step to always run and have access to the output, even when the original docker image did not support sending the output to a file via args. |
91eafc4
to
bc9ad7b
Compare
thanks @bobcatfish and @afrittoli for all the reviews. I have addressed most of the comments except two: Also, I will add drawback section after adding design proposal 🙃 |
I have expanded the scope of this TEP to include both (1 - task level) and (2 - pipeline level) given that not everything has to be part of the same PR. I can certainly drop (2) from this proposal if you would like. My rationale behind adding (2) is to make the API changes for (1) compatible with the changes we propose for (2). I have added some examples and renamed the proposal to |
025386b
to
2729f73
Compare
I do see value in the proposal and those are valid use case we need to take care of. So I have a probably naive and general question, but given the following proposal (#318 and #316), and the potential of the pipeline-in-pipeline approach (with CustomTask support), what would it mean to support those use case at the Task + Pipeline level only ? Having an option on both might be a bit confusing to users. When to fail, when to not ? I also wonder if this is a concern at the Similar to my comment on #316, I think we need to find where to put the cursor in our composability story and where each option fit. On this proposal, I feel the use case can be both "at authoring time" and "at runtime". Those might be two different problems to tackle but on the authoring part, the question remain, if we were to have a way to say "don't fail the pipeline if this task fail", would we need the same at the task level (with steps) — and my initial thought is we would not.
I wonder thus if we should mention |
2729f73
to
9c3ae5f
Compare
I think that expanding the scope of this TEP to include (2) is going to make it harder to merge - b/c of the potential overlap with the other TEPs that @vdemeester pointed out - i have been convinced that there is a legitimate gap that would address in isolation that specifically applies to: Allowing a step to fail without stopping execution of a Task at Task authoring time e.g. the use case about a platform team which wants to create a reusable Task that runs tests, and transforms their output, whether they fail or not I suggest we tackle (2) separately |
9c3ae5f
to
86f9490
Compare
Sounds good, thanks @bobcatfish @vdemeester 🙏 I have reduced its scope to only address (1): Allowing a step to fail without stopping execution of a Task at Task authoring time |
86f9490
to
dbeda27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty much ready to approve this - I do want to add one more requirement, not sure if @afrittoli has strong feelings about it.
Also not sure where you stand @vdemeester , with the reduced scope are you happy to go ahead with this, or you want to see the other TEPs through first?
fi | ||
``` | ||
|
||
This workaround does not apply to off-the-shelf container images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the detailed overview!
in an image and then altering the entrypoint to allow capturing errors. | ||
|
||
* It should be possible to know that a step failed and subsequent steps allowed to continue by observing the status of | ||
the `TaskRun`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to include a requirement that other steps within the Task can access this information, I believe @afrittoli disagrees - I'm hoping we can include this as a requirement but also could deliver the feature incrementally such that the initial implementation doesn't need to support this.
I feel strongly that we haven't sufficiently addressed the use cases and problem statement without it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added that (other steps within the task can access the process exit code) in the requirements. @afrittoli please let me know if you still disagree or have any other ideas.
Proposing a tep to ignore step errors and provide an option to continue after capturing the non zero exit code. Also document the container termination state to access it after the pipeline exectution finishes.
dbeda27
to
0b75788
Compare
This is the PoC: https://github.com/afrittoli/pipeline/tree/tep_0040 |
Thanks for all the updates! I'm happy to go ahead with the problem statement!! Will be great to take a look at our options here. /approve We'll need to get @vdemeester to review before we merge tho - he had some thoughts as well /hold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobcatfish The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I have one question : if we would go ahead with Injecting a shell into steps, then a user could wrap on any image, would this be necessary ?
I feel it could still be useful but on the other hand, the "shell" injection could also bring some "tool" to wrap the exit code and make it available later on. I am mainly asking this to discuss if the use case needs to be handle by an API field or by some tooling.
I think this would still be useful. Having a shell still requires writing a script which points to the correct entrypoint for the container image, which is more work, more error prone, and it's more brittle as the entrypoint might change in the image. Using a script on the other hand gives more flexibility, as it allows doing output redirection, reading the error code and more. So I think they're both useful features to have. |
Agreed 😉 /lgtm |
Thanks @vdemeester @afrittoli @bobcatfish we have review from @vdemeester, can I remove the hold now? Thanks 🙏 |
/hold cancel |
Proposing a TEP to ignore step error and provide an option to continue after capturing the non zero exit code.
Related issues: