Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Node Affinity for TaskRuns that share PVC workspace #2630

Conversation

jlpettersson
Copy link
Member

@jlpettersson jlpettersson commented May 16, 2020

Changes

TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using PersistentVolumeClaims.
PersistentVolumeClaims use PersistentVolumes with different access modes.
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often regional, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often zonal, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

  • Using PVC access mode ReadWriteMany. But this access mode is not widely available,
    and is typically a NFS server or another not so "cloud native" solution.

  • An alternative is to use a storage that is tied to a specific node, e.g. local volume
    and then configure so pods are scheduled to this node, but this is not commonly
    available and it has drawbacks, e.g. the pod may need to consume and mount a whole
    disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in #2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use Kubernetes inter-pod affinity for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

How it works: When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is singleton workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with QoS class Guaranteed but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

  • as a Pod - the Pod is not managed, so it will not be recreated.
  • as a Deployment - the Pod will be recreated and puts Availability before
    the singleton property
  • as a StatefulSet - the Pod will be recreated but puds the singleton
    property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Example output from example:
First a Task graph:

#             -- (upper) -- (reporter)
#           /                         \
#  (starter)                           (validator)
#           \                         /
#             -- (lower) ------------

Logs:

$ tkn pr logs parallel-pipelinerun-2wf8d -n pe
Pipeline still running ...
[starter : write] Hello Tekton
[starter : write] + tee /workspace/task-ws/message /tekton/results/message
[starter : write] + echo Hello Tekton

[upper : to-upper] + tee /workspace/w/upper /tekton/results/message
[upper : to-upper] + tr [:lower:] [:upper:]
[upper : to-upper] + cat /workspace/w/init/message
[upper : to-upper] HELLO TEKTON

[lower : to-lower] + tee /workspace/w/lower /tekton/results/message
[lower : to-lower] + tr [:upper:] [:lower:]
[lower : to-lower] + cat /workspace/w/init/message
[lower : to-lower] hello tekton

[reporter : report-result] + echo HELLO TEKTON
[reporter : report-result] HELLO TEKTON

[validator : validate-upper] + grep HELLO TEKTON
[validator : validate-upper] + cat /workspace/files/upper
[validator : validate-upper] HELLO TEKTON

[validator : validate-lower] + cat /workspace/files/lower
[validator : validate-lower] + grep hello tekton
[validator : validate-lower] hello tekton

A listing of the pods, shows that all pods that share PVC workspace, is scheduled to the same Node. The "reporter" task does not use a workspace and hence can be scheduled to any Node (this was a 4 Node cluster).

$ kubectl get po -n pe -o wide
NAME                                                   READY              NODE                          
parallel-pipelinerun-2wf8d-lower-8fv5x-pod-kc9pw       0/1 gke-c4-pool-2-6ce2d103-m28z
parallel-pipelinerun-2wf8d-reporter-qwhc2-pod-572qc    0/1 gke-c4-pool-2-6ce2d103-nwpt
parallel-pipelinerun-2wf8d-starter-456l4-pod-rb4rs     0/1 gke-c4-pool-2-6ce2d103-m28z
parallel-pipelinerun-2wf8d-upper-h24gm-pod-7q67h       0/1 gke-c4-pool-2-6ce2d103-m28z
parallel-pipelinerun-2wf8d-validator-8qwws-pod-nxjhz   0/2 gke-c4-pool-2-6ce2d103-m28z

Fixes #2586
/kind feature

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

- Add Node Affinity for TaskRuns that share PVC workspace

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 16, 2020
@tekton-robot tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 16, 2020
@tekton-robot
Copy link
Collaborator

Hi @jlpettersson. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@jlpettersson
Copy link
Member Author

A follow up task is to implement an option to provide a custom PodTemplate for the "Affinity Assistant".

/cc @vdemeester @sbwsg
/hold

@tekton-robot tekton-robot requested review from a user and vdemeester May 16, 2020 00:20
@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 16, 2020
@jlpettersson jlpettersson force-pushed the add_node-affinity_for_pods_that_share_pvc branch from dbd333f to 0104371 Compare May 16, 2020 16:09
Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! Is there data about how much faster this makes a PipelineRun that requires parallel access to a workspace? That would be a great thing to put in release notes :)

pkg/workspace/affinity_assistant_names.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/affinity_assistant.go Outdated Show resolved Hide resolved
@afrittoli
Copy link
Member

Really nice, thank you!

I'm not sure this is a real use case, but I'm just wondering if it would be possible to detect and prevent scenario like this:

  • task1 can run in parallel to task2
  • task1 and task2 both use the same workspace
  • task1 has an affinity setting that puts it on an expensive node with special features
  • task2 can run on a normal node

I think it would be nice in this case to either emit a warning and run task1 and 2 serially, or perhaps stop the pipelinerun from running, and communicate that differences in task pod affinity will require different PVCs.

@jlpettersson
Copy link
Member Author

jlpettersson commented May 18, 2020

I think it would be nice in this case to either emit a warning and run task1 and 2 serially, or perhaps stop the pipelinerun from running, and communicate that differences in task pod affinity will require different PVCs.

@afrittoli Good suggestion to generate a warning event. Currently, the affinity specified on a TaskRun-pod is overwritten with an affinity-rule specified to "attract" the Affinity Assistant (if using PVC workspace). But as you suggest, we could generated an event or notify the use about this. This feature may also be disabled by a feature-gate as implemented now.

Can we do the warning-event in it's own PR?

@jlpettersson
Copy link
Member Author

This PR was discussed in the API WG today, and can probably be "unhold" now.

@vdemeester
Copy link
Member

/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 18, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/affinity_assistant.go Do not exist 90.0%
pkg/reconciler/pipelinerun/pipelinerun.go 73.5% 73.1% -0.4

@jlpettersson jlpettersson force-pushed the add_node-affinity_for_pods_that_share_pvc branch from 27c3e84 to b010df4 Compare May 18, 2020 18:05
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/affinity_assistant.go Do not exist 90.0%
pkg/reconciler/pipelinerun/pipelinerun.go 73.5% 73.1% -0.4

@afrittoli
Copy link
Member

I think it would be nice in this case to either emit a warning and run task1 and 2 serially, or perhaps stop the pipelinerun from running, and communicate that differences in task pod affinity will require different PVCs.

@afrittoli Good suggestion to generate a warning event. Currently, the affinity specified on a TaskRun-pod is overwritten with an affinity-rule specified to "attract" the Affinity Assistant (if using PVC workspace). But as you suggest, we could generated an event or notify the use about this. This feature may also be disabled by a feature-gate as implemented now.

Can we do the warning-event in it's own PR?

Yes, sounds good

@jlpettersson
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@jlpettersson
Copy link
Member Author

This one is interesting:

parallel-pipelinerun-nlxcd-starter-rqwzh:
   pipelineTaskName: starter
   status:
     conditions:
     - lastTransitionTime: "2020-05-18T20:49:31Z"
       message: 'pod status "PodScheduled":"False"; message: "0/3 nodes are available:
         1 node(s) didn''t match pod affinity rules, 1 node(s) didn''t match
         pod affinity/anti-affinity, 2 node(s) had volume node affinity conflict."'

volume node affinity conflict is related. But this in combination with 1 node(s) didn''t match pod affinity rules sounds like a deadlock? I need to investigate this further.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/affinity_assistant.go Do not exist 88.9%
pkg/reconciler/pipelinerun/pipelinerun.go 73.5% 73.1% -0.4

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2020
@jlpettersson jlpettersson force-pushed the add_node-affinity_for_pods_that_share_pvc branch from 0ecca10 to b1348bc Compare May 22, 2020 20:36
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/affinity_assistant.go Do not exist 90.7%
pkg/reconciler/pipelinerun/pipelinerun.go 75.3% 75.9% 0.6

TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
@jlpettersson jlpettersson force-pushed the add_node-affinity_for_pods_that_share_pvc branch from b1348bc to 77db014 Compare May 22, 2020 20:52
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/affinity_assistant.go Do not exist 90.7%
pkg/reconciler/pipelinerun/pipelinerun.go 75.3% 75.9% 0.6

@jlpettersson
Copy link
Member Author

jlpettersson commented May 22, 2020

Thanks a lot! I finally managed to get through the review.
It looks really good - I have a few comments on the affinity_assistance about labels and reading config flags - but nothing really blocking.
I think it would be nice to increase the test coverage in a few spots, I left a few comments about that.
/approve

Thanks a lot @afrittoli for a thorough review! I have updated with many of your suggestions and improved test coverage as you suggested. I also responded to your feedback. Thanks!

@afrittoli afrittoli added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 1, 2020
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

1 similar comment
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 1, 2020
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, ImJasonH

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Difficult to use parallel Tasks that share files using workspace
5 participants