Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Reconcile target groups and listeners as their own entities #5004

Merged
merged 6 commits into from
Jun 11, 2024

Conversation

nrb
Copy link
Contributor

@nrb nrb commented Jun 5, 2024

Signed-off-by: Nolan Brubaker nolan@nbrubaker.com

What type of PR is this?

/kind bug

What this PR does / why we need it:

If a load balancer is created but the calls to create a target group or
listener fails (for example, due to a rate limit), creating the group or
listener is never retried. This results in a load balancer that's
created, but there is nothing attached to it, making it useless.

This change reconciles the target groups and listeners as their own entities, so that even if they fail to create initially, we'll keep retrying.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5002

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

    Reconcile target groups and listeners independently of load balancers to handle failures more gracefully.

If a load balancer is created but the calls to create a target group or
listener fails (for example, due to a rate limit), creating the group or
listener is never retried. This results in a load balancer that's
created, but there is nothing attached to it, making it useless.

Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 5, 2024
nrb added 2 commits June 6, 2024 13:46
Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 6, 2024
@nrb
Copy link
Contributor Author

nrb commented Jun 6, 2024

/test ?

@k8s-ci-robot
Copy link
Contributor

@nrb: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-build-docker
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-build-docker
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nrb
Copy link
Contributor Author

nrb commented Jun 6, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass

@nrb nrb marked this pull request as ready for review June 6, 2024 18:48
@nrb
Copy link
Contributor Author

nrb commented Jun 6, 2024

/retitle 🐛 Reconcile target groups and listeners as their own entities

@k8s-ci-robot k8s-ci-robot changed the title WIP: 🐛 Reconcile target groups and listeners as their own entities 🐛 Reconcile target groups and listeners as their own entities Jun 6, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2024
@nrb
Copy link
Contributor Author

nrb commented Jun 6, 2024

/assign @damdo

Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
@nrb
Copy link
Contributor Author

nrb commented Jun 6, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass

* reconcileTargetGroupsAndListeners needs both the desired specification
  as well as the ARN assigned by the AWS API calls.
* Rename `spec` in reconcileV2LB to `desiredLB` to clarify what data is
  actually held in it.

Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
@nrb
Copy link
Contributor Author

nrb commented Jun 7, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass

@r4f4 I clarified some of the variables and updated the arguments for reconcileTargetGroupsAndListeners. 🤞 that e2es are happy again.

@r4f4
Copy link
Contributor

r4f4 commented Jun 7, 2024

LGTM. I'll give it a try again.

@r4f4
Copy link
Contributor

r4f4 commented Jun 7, 2024

Got a successful install this time.
/lgtm

@k8s-ci-robot
Copy link
Contributor

@r4f4: changing LGTM is restricted to collaborators

In response to this:

Got a successful install this time.
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, just a question/comment.

Comment on lines 1553 to 1555
if len(existingTargetGroups.TargetGroups) == len(existingListeners.Listeners) && len(existingListeners.Listeners) == len(spec.ELBListeners) {
return existingTargetGroups.TargetGroups, existingListeners.Listeners, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment on why this is necessary? Is this an optimization?

Also is this condition always a guarantee there is nothing to do?
Especially looking at, could there be a case where the count is off and doing g.TargetGroupName == ln.TargetGroup.Name naming matching would be better?

len(existingTargetGroups.TargetGroups) == len(existingListeners.Listeners)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was an attempt to optimize so we don't go through the full loop every time.

Matching on the name might be better. I suppose it's possible that a user could create more target groups in AWS than CAPA knows about, though I'm not sure why they'd do that.

In any case, if we check that the target group names line up between AWS and our spec, we're still repeating the loop every time. The main benefit there would be that we could potentially skip doing more API calls within a single iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason we saw the LB creation fail was because of API call rate limiting in very busy envs, so avoiding extra calls when possible would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking this through some more, I think the only numeric comparison we can do that's valid all the time is if len(existingTargetGroups) < len(desiredTargetGroups) { reconcile }. Even if the lists are equal in length, it doesn't guarantee that they're the exact same names.

I'll look through this and see if I can do any further optimizations, but I'll at least remove this length check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the productive back-and-forth here.

Please consider a future PR for optimizations, because this PR is already complex (although the new tests are very helpful, they themselves are complex). 🙏

@dlipovetsky
Copy link
Contributor

/milestone v2.5.1

@k8s-ci-robot k8s-ci-robot added this to the v2.5.1 milestone Jun 10, 2024
spec func(spec infrav1.LoadBalancer) infrav1.LoadBalancer
}{
{
name: "main create flow",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this more descriptive

@dlipovetsky
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dlipovetsky

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2024
@dlipovetsky
Copy link
Contributor

/milestone v2.6.0
/milestone v2.5.1

@k8s-ci-robot k8s-ci-robot modified the milestones: v2.5.1, v2.6.0 Jun 10, 2024
When the optimization was enabled, we were missing a bunch of creation
calls that need to get defined.

Add some partial matcher support in order to allow for generated names.

Signed-off-by: Nolan Brubaker <nolan@nbrubaker.com>
Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/hold
@nrb feel free to unhold once happy with it merging

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 11, 2024
@nrb
Copy link
Contributor Author

nrb commented Jun 11, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass

@nrb
Copy link
Contributor Author

nrb commented Jun 11, 2024

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2024
@k8s-ci-robot k8s-ci-robot merged commit daedbe0 into kubernetes-sigs:main Jun 11, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load Balancer creation is not re-entrant, ELB Listeners (+ target group) failure is never retried
5 participants