-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0013 for adding a limit to pipeline concurrency #228
Conversation
|
/area tep |
/kind tep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some typos I noticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together @NikeNano -- suggested some additions and changes to the TEP :)
nit: the tep filename has a typo
Thanks for getting started on this @NikeNano ! I want to also draw your attention to #203 It's not exactly solving the same problem, but in solving the problem you are describing here, we have 2 choices:
The proposal in #203 unlocks the ability to do (2) by making it possible for some external service to decide when a PipelineRun or TaskRun is no longer "pending". I am personally more inclined toward a solution like (2) because that would prevent the Pipelines controller from taking on more responsibilities and also would make it possible for folks to implement their own custom concurrency limitations however they wanted. |
Thanks for all the feedback, I will try to look at it later today or tomorrow and comeback with the updates. |
Thanks for all the comments @jerop! I will look over it a bit more and fixe some last things before done. |
Yeah, I was inclined to favour 1, but after reading #203 is might give more possibilities if we go with 2.
Both TEPS seems to share some things in common, thanks for the pointer.
Would we in that case let the controller reach an external service to see if |
@NikeNano > Would we in that case let the controller reach an external service to see if TaskRuns should be scheduled? As I understand it this new/seperate component has to interact with the controller to effect the way things are scheduled, is it wise to add this dependence to the controller of a separate service @bobcatfish? I'm thinking of it kinda the other way around! Looking at tektoncd/pipeline#3112 you were proposing adding
This is definitely more complicated but it also means the logic to determine the limits can be entirely pluggable <-- and how important this is probably depends on the use cases, e.g. if "MaxParallel" at a PipelineRun level meets 99% of ppl's use cases, maybe that's enough. Looking at tektoncd/pipeline#1305 as an example tho @jstrachan mentioned some cases where maybe you want limits on the number of Runs against a specific repository even |
Thanks for the clarification! I like the idea with the flexibility and possibility for users to extend it further. I will update the suggested solution. |
FYI: have not forgot, just have had to much to do. Will try to get this done before the weekend. |
Thanks for the update @NikeNano , no rush! |
Update the TEP now @bobcatfish, @jerop, will have more time to respond faster now as well! |
@NikeNano could you squash your 31 commits into one please? Before we can proceed on this you'd need to sign the CLA too. |
Thank you for the work on this @NikeNano -- I like the new approach much more! @tektoncd/core-maintainers please take a look :) |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
d06bc5e
to
a73fc49
Compare
@afrittoli I have signed it now, but seems to be some latency before it goes through. Will check tomorrow again. |
This TEP describes how to add pipeline concurrency limits. Author: Niklas Hansson <niklas.sven.hansson@gmail.com> Co-authored-by: Jerop Kipruto <jerop@google.com>
@afrittoli I have squashed and signed it. How do I run the linting locally to debug? Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a lot of questions throughout. I'm still a little hazy on how exactly this would all work, particularly wrt RBAC.
|
||
As suggested [here](https://github.com/tektoncd/pipeline/issues/2591#issuecomment-647754800), we can add a field - `MaxParallelTasks` - to `PipelineRunSpec` which is an integer that represents the maximum number of `Tasks` that can run concurrently in the `Pipeline`. | ||
|
||
type PipelineRunSpec struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth wrapping this in backticks to make it into a code block.
|
||
Enable users to limit the number of tasks that can run simultaneously in a pipeline, which could help with: | ||
|
||
- Tracking and limiting how much resources a Pipeline is consuming, and thus how much it costs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to get more specific here. Reading tektoncd/pipeline#2591 it doesn't sound like this feature is supposed to solve resource or cost problems in the cluster - I think ResourceQuotas would be the right tool for that? I think this is supposed to solve the problem of flooding a service that the Pipeline relies on. In tektoncd/pipeline#2591 (comment) it's a Database server that is hit with 100 parallel reloads.
Maybe something more along the lines of:
Allow a user to limit the number of time-consuming or heavy operations being sent to a service at once in order to avoid accidentally causing a denial-of-service on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, will update.
--> | ||
|
||
#### Story 1 | ||
User has a Pipeline with 100 independent Tasks but they don't want all 100 tasks to run at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? It would be great to get a more concrete example, like the database server example from issue tektoncd/pipeline#2591
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I agree on the need of being more specific.
#### Story 1 | ||
User has a Pipeline with 100 independent Tasks but they don't want all 100 tasks to run at once. | ||
#### Story 2 | ||
User wants to limit amount of resources used by a Pipeline at a given time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be solved by a ResourceQuota.
|
||
Separating the logic if a `TaskRun` is allowed to run from the `Task` controller allows for extensibility for adding custom logic to the `Limit Service`. | ||
|
||
As suggested [here](https://github.com/tektoncd/pipeline/issues/2591#issuecomment-647754800), we can add a field - `MaxParallelTasks` - to `PipelineRunSpec` which is an integer that represents the maximum number of `Tasks` that can run concurrently in the `Pipeline`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Up until here the document describes limiting concurrent Tasks per-Pipeline. Are we intending to limit per Pipeline or per PipelineRun?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I guess PipelineRun would allow for a larger flexibility. What do you think would be best @sbwsg ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone's trying to limit the number of concurrent uses of an external resource like a DB then Pipeline-level maximum makes sense (or even Task-level tbh). If it's PipelineRun-level then a user can just spawn more PipelineRuns and ultimately overwhelm the external thing anyway I think, right? What info does the LimitService have access to in order to decide what can / cannot run? Does it even make sense to include a configurable MaxParallelTasks
as part of Tekton's CRDs? Why not make this part of the LimitService's configuration?
So I think this depends on the use-cases we're trying to solve. If we're going to pursue this we really have to nail these down and clearly decide what we're trying to achieve.
|
||
### Goals | ||
|
||
- Limit how many tasks can run concurrently in a Pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, are we limiting concurrent TaskRuns per Pipeline or per PipelineRun? I can imagine both features being desirable for different scenarios but I'm curious which one we're specifically aiming for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above.
|
||
`MaxParallelTasks` has to be >= 0 in. If `MaxParallelTasks` is not specified there should be no limit to how many `TaskRun` that can run in parallel and thus `spec.status.Pending` should be removed from all `TaskRuns`. | ||
|
||
In order to not end up with a deadlock the order of the `Tasks` in a `Pipeline` has to be respected and accounted for by the `Limit service`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you go into a bit more detail on how the deadlock could happen in the first place? That sounds like a possible Risk/Mitigation we need to be aware of.
|
||
The `Limit Service` could run similar to a control loop checking `TaskRuns` and the restrictions of `MaxParallelTasks` for the related `Pipeline`. If the count of running `TaskRuns` is less than `MaxParallelTasks`, a `TaskRun` would be update and `spec.status.Pending` removed. If the count of running `TaskRuns` equals `MaxParallelTasks`, no `TaskRun` would be updated until later when another `TaskRun` is completed. | ||
|
||
`MaxParallelTasks` has to be >= 0 in. If `MaxParallelTasks` is not specified there should be no limit to how many `TaskRun` that can run in parallel and thus `spec.status.Pending` should be removed from all `TaskRuns`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if MaxParallelTasks
is specified but there's no Limit Service
running in the cluster? Is there a default Limit Service
that Tekton Pipelines comes with? Or are we asking all users to implement their own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to have one by default, but allow users to implement there own.
MaxParallelTasks int `json:"maxParallelTasks,omitempty"` | ||
} | ||
|
||
The `Limit Service` could run similar to a control loop checking `TaskRuns` and the restrictions of `MaxParallelTasks` for the related `Pipeline`. If the count of running `TaskRuns` is less than `MaxParallelTasks`, a `TaskRun` would be update and `spec.status.Pending` removed. If the count of running `TaskRuns` equals `MaxParallelTasks`, no `TaskRun` would be updated until later when another `TaskRun` is completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the proposed Limit Service work in multi-tenant scenarios? Would there be one Limit Service per tenant? If not are you expecting the Limit Service to have global read and write access to all TaskRuns?
It might be worth documenting a bit about how RBAC requirements might be affected with this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm good point, I dont know. Maybe you have some better thought on this @bobcatfish ?
@bobcatfish: GitHub didn't allow me to assign the following users: chhsia0. Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In the experimental repo there's now discussion of building a controller that provides support for concurrency limits and demonstrates to Operators how to build their own. Operators could then take this controller and base their own implementation on it according to their org's specific concurrency rules. I'm not sure a TEP is necessary any longer if we go that route. |
I think it still would be relevant to have some sort of concurrency limit in the default controller for Tekton, maintaining a controller on your own requires some work. I think it sounds great to allow for the flexibility to allow operators to write there own though. |
I want to make sure this point isn't missed: the proposed controller in tektoncd/experimental#699 is intended to be functional and support options for configuration. Operators won't have to build their own if their use-cases are supported by the implementation there. This doesn't seem that different to the LimitService proposed here, where we'd provide some kind default implementation / configuration out the box I guess? |
|
||
## Requirements | ||
|
||
- Users can specify the maximum number of Tasks that can run concurrently in a Pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @sbwsg that it would be good to see some concrete use case to better understand if this is the right approach to solve the problem.
If a pipeline or task requires a specific service, which we do not want to overload, limiting the number of concurrent tasks in a pipeline would not allow to control that. If the problem is about resource consumption on the cluster, again that might not provide a solution since:
- we allow for execution of custom tasks, and we do not have control on the amount of resource used by a custom task (could be zero pods, or 10 pods)
- the runtime model might change, today a task == a pod, a pipeline == N pods, but we are investigating alternative runtime approaches
Experiment with this in experimental repo to begin with. |
/assign @imjasonh |
Discussed again in API Working Group today. We're going to close this TEP for now on the basis that we have a design ongoing in experimental. In combination with the Pending TEP I think this gives us enough room to start developing solutions to this problem without changes to Tekton Pipelines' Controller. This doesn't preclude bringing this functionality into the controller in future though, if needed. Reopening this TEP or starting a new one at that time makes total sense. /close |
@sbwsg: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
After suggestion from @jerop, I have made a rough draft of a TEP for how to limit work related to Pipeline concurrency. This has been discussed in the following two issues:tektoncd/pipeline#2591, tektoncd/pipeline#1305.
Related PR: tektoncd/pipeline#3112
Would be great if you could help out with the TEP @jerop, thank you!
As a starting point I understand it as the two issues discuss similar but different things, tektoncd/pipeline#2591 relates to concurrency of tasks within a pipeline while tektoncd/pipeline#1305 the concurrency of several pipelines, but maybe I am misunderstanding the discussions?