Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for PipelineRuns getting stuck in the running state in the cluster #6095

Merged
merged 1 commit into from
Apr 17, 2023

Conversation

RafaeLeal
Copy link
Contributor

@RafaeLeal RafaeLeal commented Feb 1, 2023

Changes

Closes: #6076
This PR tries to fix scenarios where in the same reconciliation cycle the PipelineRun status gets too big to be updated and the status is changed to timeout. This means that the UpdateStatus fails, so the PipelineRun can't get out from the running status, even after the timeout.
The PR introduces an arbitrary threshold (2*timeout), and if the reconciliation notice that threshold is reached, it skips any other update, and updates only the status, making the request as small as possible, avoiding etcd request size errors.

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fix a bug that made big PipelineRuns get stuck in the running state in the cluster

@tekton-robot tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label Feb 1, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 1, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: RafaeLeal / name: Rafael Leal (bcfc8b2)

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 1, 2023
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 86.6% 86.5% -0.0
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@RafaeLeal
Copy link
Contributor Author

/kind bug

@tekton-robot tekton-robot added kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesnt merit a release note. labels Feb 3, 2023
@RafaeLeal RafaeLeal force-pushed the CICD-1018/long-timed-out branch from 9360336 to 7feb9d2 Compare February 6, 2023 12:33
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.2% -0.1
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.2% -0.1
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, I think it looks good and the tests are really nice.
I have a couple questions / comments on the logic.

pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun_test.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun_test.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
return err
}

return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than returning nil I think we could return a "PermanentError" instead, which means that the key won't be re-queued to the controller but we'll still track this in an error situation and we can provide a context in the error about what is going on - i.e. that the pipeline run condition is still not updated after 2*timeout so we apply a different logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll change it, but should we consider returning a controller.NewRequeueImmediately()? Since we are testing for !isPipelineRunTimeoutConditionSet(pr), it would not be in an infinite loop.

pkg/reconciler/pipelinerun/timeout.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/timeout.go Show resolved Hide resolved
@afrittoli
Copy link
Member

Thanks, @RafaeLeal for this PR!

Could you provide a bit more context about the issue in the commit message / PR description, and fill in the PR checklist from the template? Also, tide does not automatically squash commits before merging, so I would ask you to please squash them at least once the PR has been approved.

Thank you!

@RafaeLeal RafaeLeal changed the title Implement fix for PipelineRuns getting stuck in the cluster Fix for PipelineRuns getting stuck in the running state in the cluster Feb 6, 2023
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 83.5% -4.9
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@RafaeLeal RafaeLeal force-pushed the CICD-1018/long-timed-out branch from 45f2c27 to b86e0b0 Compare February 6, 2023 23:02
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 89.0% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@RafaeLeal
Copy link
Contributor Author

Could you provide a bit more context about the issue in the commit message / PR description, and fill in the PR checklist from the template? Also, tide does not automatically squash commits before merging, so I would ask you to please squash them at least once the PR has been approved.

Sure, I was not done yet actually. I added another test on HasTimedOutForALongTime func. I had some ideas on how to avoid arbitrary thresholds, that I shared in the issue, which I quote here.

While implementing it, I was considering maybe we could always do a two-step timing out, this way we could avoid arbitrary thresholds. The first reconciliation would check pr.HasTimedOut() and mark the status and return controller.NewRequeueImmediately(). This would trigger a UpdateStatus with only the condition change, then in the second reconciliation, we could try to update the rest of the status (the childReferences, for example).

This could work, I think. But the problem I had is that we still depend a lot on the order of the execution to make everything work properly. In the second reconciliation, we already have the timed-out condition, which means the pr.IsDone() is true and this changes the whole reconciliation process.
https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/pipelinerun/pipelinerun.go#L205-L225

Looking at this, does it make sense to "fail fast" the reconciliation process?
For example, even inside in this pr.IsDone() branch, it runs several cleanup processes. If one fails, should we really prevent the next one from executing? 🤔

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 88.3% 89.0% 0.6
pkg/reconciler/pipelinerun/pipelinerun.go 87.2% 87.0% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@RafaeLeal RafaeLeal force-pushed the CICD-1018/long-timed-out branch from 3e69fb8 to 4215107 Compare March 31, 2023 16:03
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7
pkg/reconciler/pipelinerun/pipelinerun.go 88.7% 88.5% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7
pkg/reconciler/pipelinerun/pipelinerun.go 88.7% 88.5% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@RafaeLeal
Copy link
Contributor Author

/retest

pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun.go Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun_test.go Outdated Show resolved Hide resolved

// this limit is just enough to set the timeout condition, but not enough for extra metadata.
etcdRequestSizeLimit := 650
prt.TestAssets.Clients.Pipeline.PrependReactor("update", "pipelineruns", withEtcdRequestSizeLimit(t, etcdRequestSizeLimit))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@RafaeLeal RafaeLeal force-pushed the CICD-1018/long-timed-out branch from 4cf4627 to 0f3b776 Compare April 4, 2023 22:56
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7
pkg/reconciler/pipelinerun/pipelinerun.go 88.8% 88.6% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 88.6% 0.7
pkg/reconciler/pipelinerun/pipelinerun.go 88.8% 88.6% -0.2
pkg/reconciler/pipelinerun/timeout.go 84.1% 84.8% 0.7

Copy link
Member

@lbernick lbernick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RafaeLeal do you think there is any work remaining to address #6076 once this is merged? (If not I'll link the issue directly to the PR)

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lbernick

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 5, 2023
@RafaeLeal
Copy link
Contributor Author

@RafaeLeal do you think there is any work remaining to address #6076 once this is merged? (If not I'll link the issue directly to the PR)

I don't think so, you can link the issue.

@RafaeLeal
Copy link
Contributor Author

/retest

@lbernick lbernick added this to the Pipelines v0.47 milestone Apr 6, 2023
Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @RafaeLeal, left some suggestions to make the code cleaner

pkg/reconciler/pipelinerun/timeout.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/timeout.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/timeout.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
@RafaeLeal RafaeLeal force-pushed the CICD-1018/long-timed-out branch from 0f3b776 to d38d349 Compare April 15, 2023 00:10
@RafaeLeal
Copy link
Contributor Author

@jerop @lbernick accepted all suggestions and rebased 👍

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 89.8% 1.9
pkg/reconciler/pipelinerun/pipelinerun.go 89.3% 89.1% -0.2
pkg/reconciler/pipelinerun/timeout.go 88.2% 87.9% -0.4

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 89.8% 1.9
pkg/reconciler/pipelinerun/pipelinerun.go 89.3% 89.1% -0.2
pkg/reconciler/pipelinerun/timeout.go 88.2% 87.9% -0.4

@RafaeLeal RafaeLeal force-pushed the CICD-1018/long-timed-out branch from d38d349 to 34e8778 Compare April 17, 2023 14:44
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 89.8% 1.9
pkg/reconciler/pipelinerun/pipelinerun.go 89.3% 89.1% -0.2
pkg/reconciler/pipelinerun/timeout.go 88.2% 87.9% -0.4

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go 87.9% 89.8% 1.9
pkg/reconciler/pipelinerun/pipelinerun.go 89.3% 89.1% -0.2
pkg/reconciler/pipelinerun/timeout.go 88.2% 87.9% -0.4

@lbernick lbernick self-assigned this Apr 17, 2023
Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks @RafaeLeal!

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2023
@tekton-robot tekton-robot merged commit e376334 into tektoncd:main Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tekton unable to handle PipelineRuns too big
5 participants