Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid occasional failures when using remote resolution #6424

Merged

Conversation

l-qing
Copy link
Contributor

@l-qing l-qing commented Mar 22, 2023

fix #6408

When the time interval between two reconciliations of the
owner (TaskRun, PipelineRun) of a ResolutionRequest is short,
it may cause the second reconciliation to fail when triggering
a Submit because the informer cache may not have been updated yet.

In this case, we can assume that it is in progress, and the next
reconciliation will handle it based on the actual situation.

Changes

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Avoid occasional failures of TaskRun/PipelineRun execution using remote resolution.

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 22, 2023
@tekton-robot
Copy link
Collaborator

Hi @l-qing. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@l-qing
Copy link
Contributor Author

l-qing commented Mar 22, 2023

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 22, 2023
@chuangw6
Copy link
Member

/assign

@QuanZhang-William
Copy link
Member

/assign

@l-qing l-qing force-pushed the fix/resolution-request-submit-failed branch from bccf89e to b457654 Compare March 22, 2023 22:36
@dibyom
Copy link
Member

dibyom commented Mar 23, 2023

/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 23, 2023
@l-qing l-qing force-pushed the fix/resolution-request-submit-failed branch from b457654 to 4edb527 Compare March 23, 2023 14:47
@tekton-robot tekton-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 23, 2023
@l-qing
Copy link
Contributor Author

l-qing commented Mar 23, 2023

/hold

I need to check my account settings before the final merge.

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 23, 2023
Copy link
Member

@QuanZhang-William QuanZhang-William left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @l-qing! Could you please also add an unit test for it?

@@ -54,7 +55,10 @@ var _ Requester = &CRDRequester{}
func (r *CRDRequester) Submit(ctx context.Context, resolver ResolverName, req Request) (ResolvedResource, error) {
rr, _ := r.lister.ResolutionRequests(req.Namespace()).Get(req.Name())
if rr == nil {
if err := r.createResolutionRequest(ctx, resolver, req); err != nil {
if err := r.createResolutionRequest(ctx, resolver, req); err != nil &&
// If the request already exists then we can assume that is in progress.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add the reason why it may already exist in the comment? I personally didn't get the idea until reading the commit message 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I add a simple description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@l-qing
Copy link
Contributor Author

l-qing commented Mar 24, 2023

/remove-hold

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2023
@l-qing l-qing force-pushed the fix/resolution-request-submit-failed branch from 4edb527 to 5060886 Compare March 24, 2023 04:00
@l-qing
Copy link
Contributor Author

l-qing commented Mar 24, 2023

Could you please also add an unit test for it?

Can I create a separate pull request to add this unit test?
I anticipate that writing this unit test may be a bit complex.

Previously, there were no unit tests in this package. This feature relies on some interfaces.
I usually use gomock to mock interfaces, but I'm uncertain if this method is acceptable in Tekton.
I need to see how your community typically write unit tests before I can create a unit test that is similar in style to yours.

My computer needs to be repaired this weekend, so I may not have time to work on it, but I should have time next weekend. 😁

@l-qing
Copy link
Contributor Author

l-qing commented Mar 24, 2023

pinging @lbernick @imjasonh for approval

@lbernick
Copy link
Member

Thank you for the fix @l-qing! I have a few questions:

  • When you say "submitting resolution requests quickly", do you mean creating many PipelineRuns/TaskRuns using remote resolution in a short window of time?
  • What cache are you referring to here?
  • Do you have a sense of why duplicate ResolutionRequests are being created?
  • I saw on the issue you mentioned "In my environment, fixing it this way can avoid that error." I'm curious if you were able to reproduce the error, and how you know that this fix resolved it?

I agree with @QuanZhang-William we should have unit tests for this but I see your point that this package is not really tested at all; I've created #6429 to track testing.

Can you please reword the release note to mention changes visible to users? Users don't typically interact with ResolutionRequests (or shouldn't have to); they mainly interact with TaskRuns/PipelineRuns using remote resolution

fix tektoncd#6408

When the time interval between two reconciliations of the
owner (TaskRun, PipelineRun) of a ResolutionRequest is short,
it may cause the second reconciliation to fail when triggering
a Submit because the informer cache may not have been updated yet.

In this case, we can assume that it is in progress, and the next
reconciliation will handle it based on the actual situation.
@l-qing l-qing force-pushed the fix/resolution-request-submit-failed branch from 5060886 to c5c2a0f Compare March 24, 2023 14:32
@l-qing
Copy link
Contributor Author

l-qing commented Mar 24, 2023

Hi, @lbernick I have revised my comments and release note based on your suggestions.

Replying to your questions:

1. "submitting resolution requests quickly"

I mean the same request reconciles frequently, such as TaskRun/PipelineRun.
Because the same request will always have the same name.

Call chain:

func (c *Reconciler) prepare(ctx context.Context, tr *v1beta1.TaskRun) (*v1beta1.TaskSpec, *resources.ResolvedTask, error) {
ctx, span := c.tracerProvider.Tracer(TracerName).Start(ctx, "prepare")
defer span.End()
logger := logging.FromContext(ctx)
tr.SetDefaults(ctx)
// list VerificationPolicies for trusted resources
vp, err := c.verificationPolicyLister.VerificationPolicies(tr.Namespace).List(labels.Everything())
if err != nil {
return nil, nil, fmt.Errorf("failed to list VerificationPolicies from namespace %s with error %v", tr.Namespace, err)
}
getTaskfunc := resources.GetTaskFuncFromTaskRun(ctx, c.KubeClientSet, c.PipelineClientSet, c.resolutionRequester, tr, vp)

// resolveTask accepts an impl of remote.Resolver and attempts to
// fetch a task with given name. An error is returned if the
// remoteresource doesn't work or the returned data isn't a valid
// v1beta1.TaskObject.
func resolveTask(ctx context.Context, resolver remote.Resolver, name string, kind v1beta1.TaskKind, k8s kubernetes.Interface) (v1beta1.TaskObject, *v1beta1.ConfigSource, error) {
// Because the resolver will only return references with the same kind (eg ClusterTask), this will ensure we
// don't accidentally return a Task with the same name but different kind.
obj, configSource, err := resolver.Get(ctx, strings.TrimSuffix(strings.ToLower(string(kind)), "s"), name)

// Get implements remote.Resolver.
func (resolver *Resolver) Get(ctx context.Context, _, _ string) (runtime.Object, *v1beta1.ConfigSource, error) {
resolverName := remoteresource.ResolverName(resolver.resolverName)
req, err := buildRequest(resolver.resolverName, resolver.owner, resolver.targetName, resolver.targetNamespace, resolver.params)
if err != nil {
return nil, nil, fmt.Errorf("error building request for remote resource: %w", err)
}
resolved, err := resolver.requester.Submit(ctx, resolverName, req)

// Submit constructs a ResolutionRequest object and submits it to the
// kubernetes cluster, returning any errors experienced while doing so.
// If ResolutionRequest is succeeded then it returns the resolved data.
func (r *CRDRequester) Submit(ctx context.Context, resolver ResolverName, req Request) (ResolvedResource, error) {
rr, _ := r.lister.ResolutionRequests(req.Namespace()).Get(req.Name())
if rr == nil {
if err := r.createResolutionRequest(ctx, resolver, req); err != nil {

2. "What cache are you referring to here?"

I mean the informer cache.

func (f *resolutionRequestInformer) defaultInformer(client versioned.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {
return NewFilteredResolutionRequestInformer(client, f.namespace, resyncPeriod, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, f.tweakListOptions)
}
func (f *resolutionRequestInformer) Informer() cache.SharedIndexInformer {
return f.factory.InformerFor(&resolutionv1alpha1.ResolutionRequest{}, f.defaultInformer)
}
func (f *resolutionRequestInformer) Lister() v1alpha1.ResolutionRequestLister {
return v1alpha1.NewResolutionRequestLister(f.Informer().GetIndexer())
}

3. "Do you have a sense of why duplicate ResolutionRequests are being created?"

This is bound to happen as long as a TaskRun that uses remote tasks reconciles quickly twice in its initial creation phase.

4. "how you know that this fix resolved it?"

Previously, this occurred very frequently in my environment. About 8 out of 10 times, it would fail due to this issue.
I added some logs and found that AlreadyExists error did occur.

@l-qing l-qing changed the title Avoid failures when submitting resolution requests quickly Avoid occasional failures when using remote resolution Mar 24, 2023
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lbernick, QuanZhang-William

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2023
@lbernick
Copy link
Member

Thank you for the detailed explanation, and for assigning yourself to the issue I created!

@l-qing
Copy link
Contributor Author

l-qing commented Mar 24, 2023

@chitrangpatel Hi, could you please review again and give an LGTM label?

@chitrangpatel
Copy link
Contributor

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2023
@tekton-robot tekton-robot merged commit af30ab5 into tektoncd:main Mar 27, 2023
@l-qing
Copy link
Contributor Author

l-qing commented Mar 27, 2023

@chitrangpatel Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

it contains Tasks that don't exist: Couldn't retrieve Task ""
7 participants