-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assume task not skipped if the run is associated #4583
Conversation
Hi @devholic. Thanks for your PR. I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Thank you @devholic ! |
The following is the coverage report on the affected files.
|
/test pull-tekton-pipeline-alpha-integration-tests |
if t.IsCustomTask() { | ||
return t.Run != nil | ||
} | ||
return t.TaskRun != nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afrittoli @jerop @vdemeester my understanding was this is not possible 🙃 A taskRun
or run
is associated must have a succeeded
condition with unknown
i.e. running
status.
We rely on the same check for the non-final tasks:
case facts.isFinalTask(t.PipelineTask.Name) || t.IsStarted(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure about the root cause (my current assumption is a timing issue), but I think it makes sense to check if Run
exists only and change GetFinalTasks()
to use it instead of IsSuccessful()
// GetFinalTasks returns a list of final tasks without any taskRun associated with it
// GetFinalTasks returns final tasks only when all DAG tasks have finished executing successfully or skipped or
// any one DAG task resulted in failure
func (facts *PipelineRunFacts) GetFinalTasks() PipelineRunState {
tasks := PipelineRunState{}
finalCandidates := sets.NewString()
// check either pipeline has finished executing all DAG pipelineTasks
// or any one of the DAG pipelineTask has failed
if facts.checkDAGTasksDone() {
// return list of tasks with all final tasks
for _, t := range facts.State {
// if facts.isFinalTask(t.PipelineTask.Name) && !t.IsSuccessful() {
if facts.isFinalTask(t.PipelineTask.Name) && !t.IsScheduled() {
finalCandidates.Insert(t.PipelineTask.Name)
}
}
tasks = facts.State.getNextTasks(finalCandidates)
}
return tasks
}
Whether it has a Succeeded-type condition
or not, I think we can assume the task is scheduled since the reconciler found that it has associated Run
.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritidesai also thought that a TaskRun
or Run
must have a ConditionSucceeded
- but it looks like we don't initialize it when we create a TaskRun
in the PipelineRun
reconciler:
pipeline/pkg/reconciler/pipelinerun/pipelinerun.go
Lines 722 to 777 in a343e75
func (c *Reconciler) createTaskRun(ctx context.Context, rprt *resources.ResolvedPipelineRunTask, pr *v1beta1.PipelineRun, storageBasePath string, getTimeoutFunc getTimeoutFunc) (*v1beta1.TaskRun, error) { | |
logger := logging.FromContext(ctx) | |
tr, _ := c.taskRunLister.TaskRuns(pr.Namespace).Get(rprt.TaskRunName) | |
if tr != nil { | |
// Don't modify the lister cache's copy. | |
tr = tr.DeepCopy() | |
// is a retry | |
addRetryHistory(tr) | |
clearStatus(tr) | |
tr.Status.MarkResourceOngoing("", "") | |
logger.Infof("Updating taskrun %s with cleared status and retry history (length: %d).", tr.GetName(), len(tr.Status.RetriesStatus)) | |
return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{}) | |
} | |
rprt.PipelineTask = resources.ApplyPipelineTaskContexts(rprt.PipelineTask) | |
taskRunSpec := pr.GetTaskRunSpec(rprt.PipelineTask.Name) | |
tr = &v1beta1.TaskRun{ | |
ObjectMeta: metav1.ObjectMeta{ | |
Name: rprt.TaskRunName, | |
Namespace: pr.Namespace, | |
OwnerReferences: []metav1.OwnerReference{*kmeta.NewControllerRef(pr)}, | |
Labels: combineTaskRunAndTaskSpecLabels(pr, rprt.PipelineTask), | |
Annotations: combineTaskRunAndTaskSpecAnnotations(pr, rprt.PipelineTask), | |
}, | |
Spec: v1beta1.TaskRunSpec{ | |
Params: rprt.PipelineTask.Params, | |
ServiceAccountName: taskRunSpec.TaskServiceAccountName, | |
Timeout: getTimeoutFunc(ctx, pr, rprt, c.Clock), | |
PodTemplate: taskRunSpec.TaskPodTemplate, | |
StepOverrides: taskRunSpec.StepOverrides, | |
SidecarOverrides: taskRunSpec.SidecarOverrides, | |
}} | |
if rprt.ResolvedTaskResources.TaskName != "" { | |
// We pass the entire, original task ref because it may contain additional references like a Bundle url. | |
tr.Spec.TaskRef = rprt.PipelineTask.TaskRef | |
} else if rprt.ResolvedTaskResources.TaskSpec != nil { | |
tr.Spec.TaskSpec = rprt.ResolvedTaskResources.TaskSpec | |
} | |
var pipelinePVCWorkspaceName string | |
var err error | |
tr.Spec.Workspaces, pipelinePVCWorkspaceName, err = getTaskrunWorkspaces(pr, rprt) | |
if err != nil { | |
return nil, err | |
} | |
if !c.isAffinityAssistantDisabled(ctx) && pipelinePVCWorkspaceName != "" { | |
tr.Annotations[workspace.AnnotationAffinityAssistantName] = getAffinityAssistantName(pipelinePVCWorkspaceName, pr.Name) | |
} | |
resources.WrapSteps(&tr.Spec, rprt.PipelineTask, rprt.ResolvedTaskResources.Inputs, rprt.ResolvedTaskResources.Outputs, storageBasePath) | |
logger.Infof("Creating a new TaskRun object %s for pipeline task %s", rprt.TaskRunName, rprt.PipelineTask.Name) | |
return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).Create(ctx, tr, metav1.CreateOptions{}) | |
} |
Then we initialize ConditionSucceeded
in the TaskRun
reconciler:
pipeline/pkg/reconciler/taskrun/taskrun.go
Lines 99 to 100 in a343e75
if !tr.HasStarted() { | |
tr.Status.InitializeConditions() |
If the above is correct, that a TaskRun
or Run
can exist without ConditionsSucceeded
, then it makes sense to either initialize ConditionsSucceeded
to Unknown
in createTaskRun
/createRun
functions or check for existence of TaskRun
or Run
instead of ConditionsSucceeded
which could be missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerop thanks for the review! 🙂
I prefer checking the existence of TaskRun
or Run
instead of initializing Conditions
, since it is not possible to create resources with Status (as far as I know - please let me know if this is wrong 🙏 ) - which may cause this issue again since the reconciliation time is not guaranteed for the resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devholic makes sense to go with that option 👍🏾
cc @tektoncd/core-maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the TaskRun
condition is set by the TaskRun
reconciler on the first reconciliation loop.
In the TaskRun
controller, at least within the ReconcileKind
function, the condition will always be set as initialising the condition is the first thing we do in ReconcileKind
.
In the PipelineRun
controller instead there is no guarantee. It may not be very likely to catch a TaskRun
with no condition set, but it is certainly possible and I guess it may happen more under high load.
Could someone give comments on this PR or issue(#4582)? 🙏 |
The following is the coverage report on the affected files.
|
/test pull-tekton-pipeline-alpha-integration-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the contribution @devholic!
if t.IsCustomTask() { | ||
return t.Run != nil | ||
} | ||
return t.TaskRun != nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritidesai also thought that a TaskRun
or Run
must have a ConditionSucceeded
- but it looks like we don't initialize it when we create a TaskRun
in the PipelineRun
reconciler:
pipeline/pkg/reconciler/pipelinerun/pipelinerun.go
Lines 722 to 777 in a343e75
func (c *Reconciler) createTaskRun(ctx context.Context, rprt *resources.ResolvedPipelineRunTask, pr *v1beta1.PipelineRun, storageBasePath string, getTimeoutFunc getTimeoutFunc) (*v1beta1.TaskRun, error) { | |
logger := logging.FromContext(ctx) | |
tr, _ := c.taskRunLister.TaskRuns(pr.Namespace).Get(rprt.TaskRunName) | |
if tr != nil { | |
// Don't modify the lister cache's copy. | |
tr = tr.DeepCopy() | |
// is a retry | |
addRetryHistory(tr) | |
clearStatus(tr) | |
tr.Status.MarkResourceOngoing("", "") | |
logger.Infof("Updating taskrun %s with cleared status and retry history (length: %d).", tr.GetName(), len(tr.Status.RetriesStatus)) | |
return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).UpdateStatus(ctx, tr, metav1.UpdateOptions{}) | |
} | |
rprt.PipelineTask = resources.ApplyPipelineTaskContexts(rprt.PipelineTask) | |
taskRunSpec := pr.GetTaskRunSpec(rprt.PipelineTask.Name) | |
tr = &v1beta1.TaskRun{ | |
ObjectMeta: metav1.ObjectMeta{ | |
Name: rprt.TaskRunName, | |
Namespace: pr.Namespace, | |
OwnerReferences: []metav1.OwnerReference{*kmeta.NewControllerRef(pr)}, | |
Labels: combineTaskRunAndTaskSpecLabels(pr, rprt.PipelineTask), | |
Annotations: combineTaskRunAndTaskSpecAnnotations(pr, rprt.PipelineTask), | |
}, | |
Spec: v1beta1.TaskRunSpec{ | |
Params: rprt.PipelineTask.Params, | |
ServiceAccountName: taskRunSpec.TaskServiceAccountName, | |
Timeout: getTimeoutFunc(ctx, pr, rprt, c.Clock), | |
PodTemplate: taskRunSpec.TaskPodTemplate, | |
StepOverrides: taskRunSpec.StepOverrides, | |
SidecarOverrides: taskRunSpec.SidecarOverrides, | |
}} | |
if rprt.ResolvedTaskResources.TaskName != "" { | |
// We pass the entire, original task ref because it may contain additional references like a Bundle url. | |
tr.Spec.TaskRef = rprt.PipelineTask.TaskRef | |
} else if rprt.ResolvedTaskResources.TaskSpec != nil { | |
tr.Spec.TaskSpec = rprt.ResolvedTaskResources.TaskSpec | |
} | |
var pipelinePVCWorkspaceName string | |
var err error | |
tr.Spec.Workspaces, pipelinePVCWorkspaceName, err = getTaskrunWorkspaces(pr, rprt) | |
if err != nil { | |
return nil, err | |
} | |
if !c.isAffinityAssistantDisabled(ctx) && pipelinePVCWorkspaceName != "" { | |
tr.Annotations[workspace.AnnotationAffinityAssistantName] = getAffinityAssistantName(pipelinePVCWorkspaceName, pr.Name) | |
} | |
resources.WrapSteps(&tr.Spec, rprt.PipelineTask, rprt.ResolvedTaskResources.Inputs, rprt.ResolvedTaskResources.Outputs, storageBasePath) | |
logger.Infof("Creating a new TaskRun object %s for pipeline task %s", rprt.TaskRunName, rprt.PipelineTask.Name) | |
return c.PipelineClientSet.TektonV1beta1().TaskRuns(pr.Namespace).Create(ctx, tr, metav1.CreateOptions{}) | |
} |
Then we initialize ConditionSucceeded
in the TaskRun
reconciler:
pipeline/pkg/reconciler/taskrun/taskrun.go
Lines 99 to 100 in a343e75
if !tr.HasStarted() { | |
tr.Status.InitializeConditions() |
If the above is correct, that a TaskRun
or Run
can exist without ConditionsSucceeded
, then it makes sense to either initialize ConditionsSucceeded
to Unknown
in createTaskRun
/createRun
functions or check for existence of TaskRun
or Run
instead of ConditionsSucceeded
which could be missing
There are two separate entities involved, PipelineRun controller creates a new The taskRun controller relies on the fact that the If you are running into this kind of issue:
Checking if the We have a test to confirm this does not happen in our CI/CD but this could be cluster/configuration specific and might not be feasible to replicate and/or reproduce it. I would highly recommend adding such multi-threaded e2e test in our CI/CD if possible. I would recommend to add an additional check if the startTime is set, very similar to the check |
We highly appreciate any additional best practices or troubleshooting documentation from your experience here. |
@pritidesai Thanks for the review and comment 🙇♂️
I'm not sure if I understand this correctly, but I think TaskRun (Object / CRD) is created in
As far as I know, race condition will not happen even in the multi-threaded controller. Please let me know if my understanding is wrong.
I'll try to figure out the method to reproduce this behavior as much as I can 🙏
I'll add test results to this thread 🙂 |
The following is the coverage report on the affected files.
|
For what it's worth, I was able to reproduce this, with
The original I did tweak the test script slightly to just run 100 With Pipeline 0.36.0, I generally saw the first 80 or so |
The following is the coverage report on the affected files.
|
/test pull-tekton-pipeline-alpha-integration-tests |
I rebased changes :) |
tektoncd#4582 Currently, the status of PipelineRun with Finally is flakey. For example, PipelineRun's completionTime is earlier than the last TaskRun's completionTime, and also Running Finally tasks are marked as Skipped. This issue occurs when TaskRun has no conditions. Because the task state context is not applied to the running Finally tasks and IsFinallySkipped assumes TaskRun with no conditions as not started. This commit resolves this issue by assuming the Finally task is not skipped if the run is associated instead of checking the `Succeeded` condition. Signed-off-by: Sunghoon Kang <hoon@linecorp.com>
The following is the coverage report on the affected files.
|
if facts.isFinalTask(t.PipelineTask.Name) && !t.isSuccessful() { | ||
if facts.isFinalTask(t.PipelineTask.Name) && !t.isScheduled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would break retries in finally tasks - the current code purposely includes failed tasks because getNextTasks
will do the filtering again, and it needs failed tasks to be in so that getRetryableTasks
can get tasks which failed but may be retried. The comment on L405 should be updated to reflect that.
I find the code a bit confusing here because it seems that the filter logic is shared between the GetFinalTasks
and getNextTasks
. I think that L415 could be:
if facts.isFinalTask(t.PipelineTask.Name) {
So that we get all final tasks and the filtering is done only by getNextTasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR, I've only reviewed the code, not the tests yet.
I think there is one issue with the change in pipelinerunstate.go
- if I'm correct it also means we don't have a test case for that. wdyt?
The rest seems good. I'll release tomorrow, so perhaps we can still get this fixed in time?
tektoncd#4583 (comment) We don't have to filter final task candidates since `getNextTasks` returns the tasks which needs to be executed next only. Signed-off-by: Sunghoon Kang <hoon@linecorp.com>
tektoncd#4583 (comment) We don't have to filter final task candidates since `getNextTasks` returns the tasks which needs to be executed next only. Signed-off-by: Sunghoon Kang <hoon@linecorp.com>
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
Thanks for the review @afrittoli 🙂 As you pointed out, delegating task filtering logic to I updated code with a test case. Could you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for discovering and fixing this issue!
The new test coverage looks good to me, it catches the issues I highlight in the previous revision.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@pritidesai @abayer if you manage to review this today it can still make the release :) |
/lgtm |
/retest |
2 similar comments
/retest |
/retest |
Changes
Resolves #4582
Currently, the status of PipelineRun with Finally is flakey. For example, PipelineRun's completionTime is earlier than the last TaskRun's completionTime, and also Running Finally tasks are marked as Skipped.
This issue occurs when TaskRun has no conditions. Because the task state context is not applied to the running Final tasks and IsFinallySkipped assumes TaskRun with no conditions as not started.
This commit resolves this issue by assuming the finally task is not skipped if the run is associated.
/kind bug
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
Release Notes