Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip sidecar deletion logic for canceled or timed-out TaskRun #3877

Conversation

SaschaSchwarze0
Copy link
Contributor

@SaschaSchwarze0 SaschaSchwarze0 commented Apr 13, 2021

Changes

While a TaskRun timed out, I noticed some errors in Tekton's controller log:

{"level":"info","ts":"2021-04-13T09:37:55.166Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"abf578f1-be99-4b31-88dd-92394dfafa90","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8","duration":0.0070604}
{"level":"warn","ts":"2021-04-13T09:38:07.000Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:563","msg":"stopping task run \"sascha-dev-zsmk6-kppc8\" because of \"TaskRunTimeout\"","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"40a1100e-ecab-49f7-8ac9-4a783851d790","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8"}
{"level":"info","ts":"2021-04-13T09:38:07.009Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"build-test\", Name:\"sascha-dev-zsmk6-kppc8\", UID:\"922dfc0c-770f-4b91-b5d6-6f917fce7110\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"4090\", FieldPath:\"\"}): type: 'Warning' reason: 'Failed' TaskRun \"sascha-dev-zsmk6-kppc8\" failed to finish within \"15s\"","commit":"5dc24d2"}
{"level":"info","ts":"2021-04-13T09:38:07.061Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"40a1100e-ecab-49f7-8ac9-4a783851d790","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8","duration":0.0611236}
{"level":"warn","ts":"2021-04-13T09:38:07.061Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:563","msg":"stopping task run \"sascha-dev-zsmk6-kppc8\" because of \"TaskRunTimeout\"","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"28550396-ba40-4108-bcf0-00da57dd2e59","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8"}
{"level":"info","ts":"2021-04-13T09:38:07.068Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"69852f31-53cb-4756-b64d-d5597b6205f4","knative.dev/key":"build-test/sascha-dev-zsmk6","duration":0.0000779}
{"level":"info","ts":"2021-04-13T09:38:07.238Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"build-test\", Name:\"sascha-dev-zsmk6-kppc8\", UID:\"922dfc0c-770f-4b91-b5d6-6f917fce7110\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"4090\", FieldPath:\"\"}): type: 'Warning' reason: 'Failed' TaskRun \"sascha-dev-zsmk6-kppc8\" failed to finish within \"15s\"","commit":"5dc24d2"}
{"level":"info","ts":"2021-04-13T09:38:07.347Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"28550396-ba40-4108-bcf0-00da57dd2e59","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8","duration":0.2856071}
{"level":"info","ts":"2021-04-13T09:38:07.347Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:114","msg":"taskrun done : sascha-dev-zsmk6-kppc8 \n","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"844e896c-fd6f-4538-9451-7169ed78d2f0","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8"}
{"level":"info","ts":"2021-04-13T09:38:07.354Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"844e896c-fd6f-4538-9451-7169ed78d2f0","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8","duration":0.0065659}
{"level":"info","ts":"2021-04-13T09:38:10.395Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:114","msg":"taskrun done : sascha-dev-zsmk6-kppc8 \n","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"7b58afe1-4af6-4e22-a358-f3bc6ee4734f","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8"}
{"level":"error","ts":"2021-04-13T09:38:10.418Z","logger":"tekton-pipelines-controller","caller":"taskrun/reconciler.go:291","msg":"Returned an error","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"7b58afe1-4af6-4e22-a358-f3bc6ee4734f","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8","targetMethod":"ReconcileKind","error":"pods \"sascha-dev-zsmk6-kppc8-pod-xhmhr\" not found","stacktrace":"github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile\n\tgit.luolix.top/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:291\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:522\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:460"}
{"level":"error","ts":"2021-04-13T09:38:10.418Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:540","msg":"Reconcile error","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","duration":0.0234892,"error":"pods \"sascha-dev-zsmk6-kppc8-pod-xhmhr\" not found","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:540\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:523\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:460"}
{"level":"info","ts":"2021-04-13T09:38:10.418Z","logger":"tekton-pipelines-controller","caller":"taskrun/taskrun.go:114","msg":"taskrun done : sascha-dev-zsmk6-kppc8 \n","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"b0558064-a16c-4279-9e45-091e91d81ffe","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8"}
{"level":"info","ts":"2021-04-13T09:38:10.419Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"build-test\", Name:\"sascha-dev-zsmk6-kppc8\", UID:\"922dfc0c-770f-4b91-b5d6-6f917fce7110\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"4175\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' pods \"sascha-dev-zsmk6-kppc8-pod-xhmhr\" not found","commit":"5dc24d2"}
{"level":"error","ts":"2021-04-13T09:38:10.430Z","logger":"tekton-pipelines-controller","caller":"taskrun/reconciler.go:291","msg":"Returned an error","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","knative.dev/traceid":"b0558064-a16c-4279-9e45-091e91d81ffe","knative.dev/key":"build-test/sascha-dev-zsmk6-kppc8","targetMethod":"ReconcileKind","error":"pods \"sascha-dev-zsmk6-kppc8-pod-xhmhr\" not found","stacktrace":"github.com/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun.(*reconcilerImpl).Reconcile\n\tgit.luolix.top/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/taskrun/reconciler.go:291\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:522\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:460"}
{"level":"error","ts":"2021-04-13T09:38:10.430Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:540","msg":"Reconcile error","commit":"5dc24d2","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.taskrun.Reconciler","knative.dev/kind":"tekton.dev.TaskRun","duration":0.0120727,"error":"pods \"sascha-dev-zsmk6-kppc8-pod-xhmhr\" not found","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:540\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:523\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20210127163530-0d31134d5f4e/controller/controller.go:460"}
{"level":"info","ts":"2021-04-13T09:38:10.430Z","logger":"tekton-pipelines-controller.event-broadcaster","caller":"record/event.go:282","msg":"Event(v1.ObjectReference{Kind:\"TaskRun\", Namespace:\"build-test\", Name:\"sascha-dev-zsmk6-kppc8\", UID:\"922dfc0c-770f-4b91-b5d6-6f917fce7110\", APIVersion:\"tekton.dev/v1beta1\", ResourceVersion:\"4175\", FieldPath:\"\"}): type: 'Warning' reason: 'InternalError' pods \"sascha-dev-zsmk6-kppc8-pod-xhmhr\" not found","commit":"5dc24d2"} 

They complain a non-existing pod which is reasonable as a timed out TaskRun deletes the pod. The code is inside the stopSidecars function that imo should not try to stop a sidecar from a pod that was deleted. I am therefore extending the logic inside that function to not do something if the TaskRun is canceled or timed out.

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Docs included if any changes are user facing
  • Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

Do not run sidecar deletion logic for canceled or timed-out taskrun where pod was deleted

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Apr 13, 2021
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 13, 2021
@SaschaSchwarze0 SaschaSchwarze0 changed the title Skip sidecar deletion logic for canceled TaskRun Skip sidecar deletion logic for canceled or timed-out TaskRun Apr 13, 2021
@tekton-robot tekton-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 13, 2021
@tekton-robot
Copy link
Collaborator

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ghost
Copy link

ghost commented Apr 13, 2021

/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 13, 2021
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 76.6% 76.9% 0.4

@ghost
Copy link

ghost commented Apr 16, 2021

Interesting! Thanks for this. I guess this must be racing because the sidecar stopping code does check if the Pod is NotFound before continuing.

/approve

Edit: oh wait, it doesn't guard the stop on NotFound, it returns a permanent error. Either way I can see that this process could still race.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2021
@pritidesai
Copy link
Member

The code is inside the stopSidecars function that imo should not try to stop a sidecar from a pod that was deleted. I am therefore extending the logic inside that function to not do something if the TaskRun is canceled or timed out.

Does the sidecar exist at this point? If we extend the logic to not attempt to delete, how will the sidecar be deleted before ending the run?

@SaschaSchwarze0
Copy link
Contributor Author

The code is inside the stopSidecars function that imo should not try to stop a sidecar from a pod that was deleted. I am therefore extending the logic inside that function to not do something if the TaskRun is canceled or timed out.

Does the sidecar exist at this point? If we extend the logic to not attempt to delete, how will the sidecar be deleted before ending the run?

Sidecars are containers that run in the same pod. The pod is deleted, so is the sidecar.

@pritidesai
Copy link
Member

Sidecars are containers that run in the same pod. The pod is deleted, so is the sidecar.

Yup, my bad, for some reason I interpreted side car running as a separate Pod 🙃 , thanks for the clarification @SaschaSchwarze0 👍

so in case of cancellation and timeout, we do delete the pod so adding these checks here are reasonable and help avoid unwanted error message.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2021
@ghost
Copy link

ghost commented Apr 20, 2021

You'll need to rebase on top of main to get the integration tests passing.

  1. git fetch upstream
  2. git rebase upstream/main
  3. git push --force <your remote> <branch name>

Also applies for timed-out TaskRun. In both cases the pod has already
been deleted previously.
@SaschaSchwarze0 SaschaSchwarze0 force-pushed the sascha-do-not-stop-sidecars-of-deleted-pods branch from 958c120 to 4bb4b72 Compare April 21, 2021 06:37
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2021
@SaschaSchwarze0
Copy link
Contributor Author

You'll need to rebase on top of main to get the integration tests passing.

1. `git fetch upstream`

2. `git rebase upstream/main`

3. `git push --force <your remote> <branch name>`

Done @sbwsg

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/taskrun/taskrun.go 76.6% 76.9% 0.4

@ghost
Copy link

ghost commented Apr 21, 2021

Awesome, thank you! Re-adding Priti's lgtm as there haven't been any other code changes:

/lgtm

@tekton-robot tekton-robot assigned ghost Apr 21, 2021
@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2021
@tekton-robot tekton-robot merged commit 1475d7f into tektoncd:main Apr 21, 2021
@SaschaSchwarze0 SaschaSchwarze0 deleted the sascha-do-not-stop-sidecars-of-deleted-pods branch April 7, 2022 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants