Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipelines v0.43.1 installation nondeterministically fails to update PipelineRun status with TaskRun statuses #5964

Closed
bendory opened this issue Jan 6, 2023 · 11 comments · Fixed by #5968
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@bendory
Copy link
Contributor

bendory commented Jan 6, 2023

Expected Behavior

(Edited: moved error message from tkn to tektoncd/cli#1847)

PipelineRun status reflects status of the child TaskRuns

Actual Behavior

tkn pr list shows the PipelineRun as Running even though tkn tr list shows all related TaskRuns as Succeeded

Steps to Reproduce the Problem

  1. Use a v1beta1 PipelineRun that ran cleanly in v0.42.0
  2. Upgrade to v0.43.1
  3. Run the pipeline via kubectl create|apply --filename ...
  4. tkn pr logs --last -f
  5. wait

Additional Info

  • Kubernetes version:
$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.8", GitCommit:"fdc77503e954d1ee641c0e350481f7528e8d068b", GitTreeState:"clean", BuildDate:"2022-11-09T13:38:19Z", GoVersion:"go1.18.8", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.7-gke.900", GitCommit:"e35c4457f66187eff006dda6d2c0fe12144ef2ec", GitTreeState:"clean", BuildDate:"2022-10-26T09:25:34Z", GoVersion:"go1.18.7b7", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
$ tkn version
Client version: 0.28.0
Chains version: v0.13.1
Pipeline version: v0.43.1

I expected this to be resolved in v0.43.1 by #5945 and #5948, but no joy. 😿

@bendory bendory added the kind/bug Categorizes issue or PR as related to a bug. label Jan 6, 2023
@lbernick
Copy link
Member

lbernick commented Jan 6, 2023

Looking at the PipelineRun in question:

$ k get pr clone-build-push-run-q4mbg -o yaml
apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"tekton.dev/v1beta1","kind":"Pipeline","metadata":{"annotations":{},"name":"clone-build-push","namespace":"default"},"spec":{"description":"This pipeline clones a git repo, builds a Docker image with Kaniko and\npushes it to a registry\n","params":[{"name":"repo-url","type":"string"},{"name":"image-reference","type":"string"}],"tasks":[{"name":"fetch-source","params":[{"name":"url","value":"$(params.repo-url)"}],"taskRef":{"name":"git-clone"},"workspaces":[{"name":"output","workspace":"shared-data"}]},{"name":"build-push","params":[{"name":"IMAGE","value":"$(params.image-reference)"}],"runAfter":["fetch-source"],"taskRef":{"name":"kaniko"},"workspaces":[{"name":"source","workspace":"shared-data"}]}],"workspaces":[{"name":"shared-data"}]}}
    tekton.dev/v1beta1TaskRuns: '{"clone-build-push-run-q4mbg-build-push":{"pipelineTaskName":"build-push","status":{"conditions":[{"type":"Succeeded","status":"Unknown","lastTransitionTime":"2023-01-06T15:02:41Z","reason":"Running","message":"Not
      all Steps in the Task have finished executing"}],"podName":"clone-build-push-run-q4mbg-build-push-pod",
    ...
status:
  childReferences:
  - apiVersion: tekton.dev/v1beta1
    kind: TaskRun
    name: clone-build-push-run-q4mbg-build-push
    pipelineTaskName: build-push
  - apiVersion: tekton.dev/v1beta1
    kind: TaskRun
    name: clone-build-push-run-q4mbg-fetch-source
    pipelineTaskName: fetch-source
  conditions:
  - lastTransitionTime: "2023-01-06T15:02:30Z"
    message: 'Tasks Completed: 1 (Failed: 0, Cancelled 0), Incomplete: 1, Skipped:
      0'
    reason: Running
    status: Unknown
    type: Succeeded
$ k get tr clone-build-push-run-q4mbg-fetch-source -o yaml
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"tekton.dev/v1beta1","kind":"Pipeline","metadata":{"annotations":{},"name":"clone-build-push","namespace":"default"},"spec":{"description":"This pipeline clones a git repo, builds a Docker image with Kaniko and\npushes it to a registry\n","params":[{"name":"repo-url","type":"string"},{"name":"image-reference","type":"string"}],"tasks":[{"name":"fetch-source","params":[{"name":"url","value":"$(params.repo-url)"}],"taskRef":{"name":"git-clone"},"workspaces":[{"name":"output","workspace":"shared-data"}]},{"name":"build-push","params":[{"name":"IMAGE","value":"$(params.image-reference)"}],"runAfter":["fetch-source"],"taskRef":{"name":"kaniko"},"workspaces":[{"name":"source","workspace":"shared-data"}]}],"workspaces":[{"name":"shared-data"}]}}
  ownerReferences:
  - apiVersion: tekton.dev/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: PipelineRun
    name: clone-build-push-run-q4mbg
    uid: b9ce5106-7741-4311-81d7-b55be2e447f9
status:
    conditions:
    - lastTransitionTime: "2023-01-06T15:02:30Z"
      message: All Steps have completed executing
      reason: Succeeded
      status: "True"
      type: Succeeded
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  annotations:
    chains.tekton.dev/signed: "true"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"tekton.dev/v1beta1","kind":"Pipeline","metadata":{"annotations":{},"name":"clone-build-push","namespace":"default"},"spec":{"description":"This pipeline clones a git repo, builds a Docker image with Kaniko and\npushes it to a registry\n","params":[{"name":"repo-url","type":"string"},{"name":"image-reference","type":"string"}],"tasks":[{"name":"fetch-source","params":[{"name":"url","value":"$(params.repo-url)"}],"taskRef":{"name":"git-clone"},"workspaces":[{"name":"output","workspace":"shared-data"}]},{"name":"build-push","params":[{"name":"IMAGE","value":"$(params.image-reference)"}],"runAfter":["fetch-source"],"taskRef":{"name":"kaniko"},"workspaces":[{"name":"source","workspace":"shared-data"}]}],"workspaces":[{"name":"shared-data"}]}}
  generation: 1
  name: clone-build-push-run-q4mbg-build-push
  ownerReferences:
  - apiVersion: tekton.dev/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: PipelineRun
    name: clone-build-push-run-q4mbg
status:
  conditions:
  - lastTransitionTime: "2023-01-06T15:03:16Z"
    message: All Steps have completed executing
    reason: Succeeded
    status: "True"
    type: Succeeded

The reason the PipelineRun isn't being updated with the statuses of the TaskRuns is because the TaskRun ownerReferences are for a v1beta1 PipelineRun, but the PipelineRun is v1beta1. It seems like you originally applied a v1beta1 PipelineRun to the cluster and it got converted to v1 (and same with the TaskRun), but the child TaskRuns' ownerReferences weren't changed, so the controller doesn't know how to update the PipelineRun.

/priority critical-urgent

@tekton-robot tekton-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jan 6, 2023
@bendory
Copy link
Contributor Author

bendory commented Jan 6, 2023

Another piece of curiosity: the "stuck in Running" piece of this bug report is non-deterministic. I have two identical PRs in the cluster, one Succeeded and one stuck in Running. In this case, the first one is stuck and the second succeeded.

I tried the identical setup in a 2nd cluster and the first one Succeeded -- so I assume there is some non-deterministic behavior at play.

@lbernick
Copy link
Member

lbernick commented Jan 6, 2023

@bendory if you don't mind I'm going to split this into two different bugs; I'll keep this one to track the PipelineRun status updates and open a new one for the issue observed with the CLI.

@lbernick lbernick changed the title Pipelines v0.43.0 installation results in errors watching PipelineRun Pipelines v0.43.0 installation nondeterministically fails to update PipelineRun status with TaskRun statuses Jan 6, 2023
@lbernick lbernick added the kind/flake Categorizes issue or PR as related to a flakey test label Jan 6, 2023
@bendory
Copy link
Contributor Author

bendory commented Jan 6, 2023

@bendory if you don't mind I'm going to split this into two different bugs; I'll keep this one to track the PipelineRun status updates and open a new one for the issue observed with the CLI.

I certainly don't mind! Thank you for your attention on this!

@lbernick lbernick changed the title Pipelines v0.43.0 installation nondeterministically fails to update PipelineRun status with TaskRun statuses Pipelines v0.43.1 installation nondeterministically fails to update PipelineRun status with TaskRun statuses Jan 6, 2023
@lbernick
Copy link
Member

lbernick commented Jan 6, 2023

This PipelineRun was originally created as v1beta1, but it is being converted to v1 and I'm not sure why. Here, the k8s docs say "If you update an existing object, it is rewritten at the version that is currently the storage version. This is the only way that objects can change from one version to another." Since our storage version is v1beta1, I'm surprised anything is converting the PipelineRun to v1.

One possibility is that this is due to some interaction with kubectl. kubectl get taskruns.v1beta1.tekton.dev clone-build-push-run-q4mbg-fetch-source returns a v1beta1 taskrun, but kubectl get taskrun clone-build-push-run-q4mbg-fetch-source returns a v1 TaskRun. This leads me to believe the default version of tekton api versions used by kubectl is v1 rather than v1beta1, and when the pipelinerun is created with kubectl create, kubectl is specifying somehow that a v1 PipelineRun should be used. There's a bit more detail in kubernetes/kubernetes#58131: it says that "kubectl get deployment foo is ambiguous, since the server has deployments in multiple api groups. When a resource exists in multiple api groups, kubectl uses the first group listed in discovery docs published by the server which contains the resource." (although I am not sure what "discovery docs" refers to).

After thinking about this a bit more, it's not clearly problematic that the PipelineRun's childReferences are v1beta1 TaskRuns and the TaskRun's ownerReferences are a v1beta1 PipelineRun even though the version of these resources returned by kubectl get is v1. All of these objects should still be stored as v1beta1 versions, and the apiserver can convert into whatever version is requested. I would expect that when we update PipelineRun status from child TaskRuns, we should be able to find the child TaskRuns regardless of what version we use in the request. Therefore, I may have been wrong that this was caused by a version mismatch, and the flake may live elsewhere.

@bendory have you observed this flake more than once?

@JeromeJu
Copy link
Member

JeromeJu commented Jan 6, 2023

This leads me to believe the default version of tekton api versions used by kubectl is v1 rather than v1beta1, and when the pipelinerun is created with kubectl create, kubectl is specifying somehow that a v1 PipelineRun should be used.

I think this corresponds with
https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#version-priority. So whenever we kubectl get <tr/pr-name> it would show version of v1 as it is the higher version.
k get pipelineruns.v1beta1.tekton.dev/ or k get pipelineruns.v1.Tekton.dev/ might be helpful for accessing CRDs of specific version.

@bendory
Copy link
Contributor Author

bendory commented Jan 6, 2023

@bendory have you observed this flake more than once?

Yes, when playing earlier today, I observed it in two different clusters.

@lbernick
Copy link
Member

lbernick commented Jan 6, 2023

aha! In the output of kubectl describe I see:

  Warning  InternalError  8m10s (x34 over 4h37m)  PipelineRun  1 error occurred:
           * expected no ChildReferences with embedded-status=full, but found 2

coming from this line of code.

I think the problem here is with our conversion logic. With embedded status = full, when we convert v1beta1 to v1, we have a v1 pipelinerun w/ taskruns in annotations, and childreferences in status. When converting back to v1beta1, we have a pipelinerun with both childreferences and taskruns in the status, which is invalid (see this test case).

I'm still not sure why conversion is going on, or what's causing the nondeterministic behavior, but the error from this event should be fixed by changing our conversion logic.

@lbernick lbernick self-assigned this Jan 6, 2023
@JeromeJu
Copy link
Member

JeromeJu commented Jan 6, 2023

aha! In the output of kubectl describe I see:

  Warning  InternalError  8m10s (x34 over 4h37m)  PipelineRun  1 error occurred:
           * expected no ChildReferences with embedded-status=full, but found 2

coming from this line of code.

I think the problem here is with our conversion logic. With embedded status = full, when we convert v1beta1 to v1, we have a v1 pipelinerun w/ taskruns in annotations, and childreferences in status. When converting back to v1beta1, we have a pipelinerun with both childreferences and taskruns in the status, which is invalid (see this test case).

Thanks Lee, that makes sense! I think this test case is only valid in both embedded-status. And it was not considered 'invalid' under the condition of 'full' embedded-status by the time this case is added because there shouldn't be a chance when creating a v1 PipelineRun with annotations in it?

@JeromeJu
Copy link
Member

JeromeJu commented Jan 6, 2023

I'm still not sure why conversion is going on, or what's causing the nondeterministic behavior, but the error from this event should be fixed by changing our conversion logic.

A guess could be a v1beta1 PipelineRun got converted to v1, so the TaskRuns got converted as Annotations in v1 and then converted back to v1beta1 to be stored in etcd. (This is also the only case I can think of that there could be both childReferences and taskRuns, otherwise the v1beta1 PipelineRunStatus itself is invalid under full embedded-status?)

@lbernick
Copy link
Member

lbernick commented Jan 9, 2023

I also see this in the controller logs, which makes me more confident that this is causing the problem:

{"severity":"error","timestamp":"2023-01-08T10:25:50.442Z","logger":"tekton-pipelines-controller","caller":"pipelinerun/pipelinerun.go:233","message":"Error while syncing the pipelinerun status: expected no ChildReferences with embedded-status=full, but found 2","commit":"be1ce9d","knative.dev/controller":"git.luolix.top.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"26111e12-2260-47ec-a391-a49c32c8d5ad","knative.dev/key":"default/clone-build-push-run-q4mbg","stacktrace":"github.com/tektoncd/pipeline/pkg/reconciler/pipelinerun.(*Reconciler).ReconcileKind\n\tgit.luolix.top/tektoncd/pipeline/pkg/reconciler/pipelinerun/pipelinerun.go:233\ngit.luolix.top/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/pipelinerun.(*reconcilerImpl).Reconcile\n\tgit.luolix.top/tektoncd/pipeline/pkg/client/injection/reconciler/pipeline/v1beta1/pipelinerun/reconciler.go:235\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:542\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20221011175852-714b7630a836/controller/controller.go:491"}

Unfortunately, it doesn't seem like we can reliably reproduce the error, but I'm going to run a few PipelineRuns with #5968 applied to my cluster and see if it recurs.

Thanks Lee, that makes sense! I think this test case is only valid in both embedded-status.

I think that's exactly the problem-- our test cases don't depend on the value of this flag and they need to. Maybe better to move this discussion to #5968 if the test cases need more discussion?

And it was not considered 'invalid' under the condition of 'full' embedded-status by the time this case is added because there shouldn't be a chance when creating a v1 PipelineRun with annotations in it?

I'm not sure what you mean; you're correct that nobody is going to create a v1 pipelinerun with the relevant annotations but we need to be able to convert round-trip and with the existing logic I don't think we can.

I'm still not sure why conversion is going on, or what's causing the nondeterministic behavior, but the error from this event should be fixed by changing our conversion logic.

A guess could be a v1beta1 PipelineRun got converted to v1, so the TaskRuns got converted as Annotations in v1 and then converted back to v1beta1 to be stored in etcd.

Sorry, I should have been more clear. I'm not sure why the conversion webhook is being called at all. According to the docs this should only happen if a resource is created in a version other than the stored version, so why are we observing a PipelineRun being converted from v1beta1 to v1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants