MachinePool remains in WaitingForReplicasReady because CAPA does not reconcile node references after instance refresh #4618

AndiDog · 2023-11-06T18:29:25Z

/kind bug

What steps did you take and what happened:

Related to kubernetes-sigs/cluster-api#8858, #4071

CAPA's AWSMachinePool reconciler unconditionally returns return ctrl.Result{}, r.reconcileNormal(ctx, machinePoolScope, infraScope, infraScope), i.e. does not schedule reconciliation of the ASG's EC2 instances into .Status.Instances at regular intervals.

I made a change where CAPA triggers an instance refresh (e.g. change of AMI IDs), rolling out new EC2 instances. The parent MachinePool object remained in non-ready state with reason WaitingForReplicasReady, with CAPI continuously logging NodeRefs != ReadyReplicas messages. Only the next, random reconciliation of my AWSMachinePool object solves this by checking which instances exist in the ASG.

What did you expect to happen:

CAPA should regularly reconcile in order to check the ASG for a changed set of instances. Particularly if it's expected because CAPA triggered an instance refresh.

Environment:

Cluster-api-provider-aws version: v2.2.4 + preliminary fixes from KubeadmConfig changes should be reconciled for machine pools, triggering instance recreation cluster-api#8858 which trigger instance refresh on change of the bootstrap config reference
Kubernetes version: (use kubectl version): v1.24.14
OS (e.g. from /etc/os-release): Flatcar Linux

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-11-06T18:29:34Z

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cnmcavoy · 2023-11-09T22:04:44Z

In your opinion, is this resolved by the MachinePools Machines implementation in CAPA? If not, what is missing that would need to be added?

#4527

AndiDog · 2023-11-13T06:43:22Z

@cnmcavoy I think your PR is separate from this issue. CAPI reconciles based on AWSMachinePool.Spec.ProviderIDList (unless you tell me that will change once infra providers create <Infra>Machine objects for machine pools?). That field is already correctly updated in awsmachinepool_controller.go, but CAPA does not regularly update it when the ASG (or an explicit instance refresh) creates/rolls instances.

cnmcavoy · 2023-11-14T17:57:42Z

@cnmcavoy I think your PR is separate from this issue. CAPI reconciles based on AWSMachinePool.Spec.ProviderIDList (unless you tell me that will change once infra providers create <Infra>Machine objects for machine pools?). That field is already correctly updated in awsmachinepool_controller.go, but CAPA does not regularly update it when the ASG (or an explicit instance refresh) creates/rolls instances.

Correct... I agree that this isn't solved by #4527.

My understanding is that the solution requires a way to detect any change in the status of an ASGs instances and trigger a new reconcile of the AWSMachinePool. One approach would be to implement this ontop of the work in #4527 and have the AWSMachine enqueue their AWSMachinePool when their status changes.

Alternatively, another approach would be to use the AWS events and set up the resources to receive those. I believe there is a way to have AWS send something when the ASG changes.

AndiDog · 2023-11-15T08:22:54Z

A bulletproof solution would be to reconcile every 1-5 minutes (configurable?!) for AWSMachinePool. No matter if using events or not, since they may not arrive correctly if the controller or network is misconfigured (assuming this feature were implemented).

There's Amazon EventBridge, but it can mainly perform actions in other AWS services, so I'm not sure if it could trigger a call to a controller webhook in order for it to reconcile.

I like the idea of observing the AWSMachine state and bubbling that up to AWSMachinePool. An ASG instance refresh would include node termination after some minutes, so there's an event on Node (which we don't watch or finalize, I assume?), or a Kubernetes Event (which we don't watch). An ASG scale-up (instance added) might – in the success case – have a "node added" event in Kubernetes. Did you have something in mind how the event observation could technically work?

If we don't have a clear idea, should we first fix the low-hanging fruit and use a regular reconciliation interval (RequeueAfter)?

k8s-triage-robot · 2024-02-13T09:10:19Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-03-14T09:54:14Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

AndiDog · 2024-04-10T07:14:16Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-07-09T07:34:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

AndiDog · 2024-07-17T11:12:26Z

/remove-lifecycle stale

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 6, 2023

AndiDog mentioned this issue Nov 6, 2023

✨ Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes #4619

Merged

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 14, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 10, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2024

AndiDog mentioned this issue Jul 17, 2024

CAPI is taking too long removing taint node.cluster.x-k8s.io/uninitialized:NoSchedule from nodes kubernetes-sigs/cluster-api#9858

Closed

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2024

squizzi mentioned this issue Sep 10, 2024

Nodes bootstrapped via aws-hosted-cp get stuck with node.cloudprovider.kubernetes.io/uninitialized taint Mirantis/hmc#290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachinePool remains in WaitingForReplicasReady because CAPA does not reconcile node references after instance refresh #4618

MachinePool remains in WaitingForReplicasReady because CAPA does not reconcile node references after instance refresh #4618

AndiDog commented Nov 6, 2023 •

edited

Loading

k8s-ci-robot commented Nov 6, 2023

cnmcavoy commented Nov 9, 2023

AndiDog commented Nov 13, 2023

cnmcavoy commented Nov 14, 2023

AndiDog commented Nov 15, 2023

k8s-triage-robot commented Feb 13, 2024

k8s-triage-robot commented Mar 14, 2024

AndiDog commented Apr 10, 2024

k8s-triage-robot commented Jul 9, 2024

AndiDog commented Jul 17, 2024

MachinePool remains in WaitingForReplicasReady because CAPA does not reconcile node references after instance refresh #4618

MachinePool remains in WaitingForReplicasReady because CAPA does not reconcile node references after instance refresh #4618

Comments

AndiDog commented Nov 6, 2023 • edited Loading

k8s-ci-robot commented Nov 6, 2023

cnmcavoy commented Nov 9, 2023

AndiDog commented Nov 13, 2023

cnmcavoy commented Nov 14, 2023

AndiDog commented Nov 15, 2023

k8s-triage-robot commented Feb 13, 2024

k8s-triage-robot commented Mar 14, 2024

AndiDog commented Apr 10, 2024

k8s-triage-robot commented Jul 9, 2024

AndiDog commented Jul 17, 2024

AndiDog commented Nov 6, 2023 •

edited

Loading