MachinePool workers support in ClusterClass #5991

dlipovetsky · 2022-01-25T20:05:45Z

User Story

As a user, I would like to create a ClusterClass-based Cluster where workers are managed using MachinePools instead of MachineDeployments. This allows me, for example, to use ClusterClass with an infrastructure provider "flavor" like Cluster API Azure Provider Managed AKS, which supports only MachinePools.

Detailed Description

I want to be able to define a ClusterClass where workers are managed using MachinePools instead of MachineDeployments.

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: example
spec:
  controlPlane:
...
  workers:
    machinePools:
      - class: example
        template:
          spec:
            clusterName: example
            replicas: 10
            template:
              spec:
                bootstrap:
                  ref:
                    apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
                    kind: KubeadmConfigTemplate
                    name: example-pool-0
                clusterName: example
                infrastructureRef:
                  apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
                  kind: AzureManagedMachinePool
                  name: example-pool-0
                version: 1.22.4
  infrastructure:
...

Anything else you would like to add:

This issue follows up on a slack conversation.

Tasks

[EDIT: by @sbueringer] After further discussion we would implement this with the following sub-tasks:

APIs #8820

Extend APIs (api/v1beta1)
- Cluster
  - WorkersTopology: Add MachinePoolTopology slice
- ClusterClass
  - WorkersClass: Add MachinePoolClass slice
  - PatchSelectorMatch: Add MachinePool
- Note: excluding MachineHealthCheck fields

Controllers, webhooks, ... #9016

Reconcile MachinePools in Cluster topology (internal/controllers/topology/cluster)
- Basic reconcile
  - blueprint.go: also get MachinePool blueprint
  - current_state.go: also get MachinePool current state
  - desired_state.go: calculate desired MachinePool like MachineDeployment
  - reconcile_state.go: reconcile MachinePool like MachineDeployment
  - cluster_controller.go: also watch MachinePools
  - Surface MachinePool state like MachineDeployment state in conditions (conditions.go)
- Support patching MachinePools:
  - Extend patch engine to also handle MachinePools
  - Extend json patch generator to handle MachinePools
  - Add MachinePool builtin variables
Add corresponding validation in internal/webhooks/
- Includes additions in internal/topology/check/ & internal/topology/variables/{cluster_variable_defaulting.go, cluster_variable_validation.go}
Extend internal/controllers/ClusterClass
- Add the same that we already have for MachineDeployments
Extend CAPD
- Add DockerMachinePoolTemplate
Extend test/extension
- Also patch DockerMachinePoolTemplate
Extend quickstart test to also cover MachinePools

unit & integration tests

Take a look at the existing tests and the code added for MachinePools and add test coverage accordingly for MachinePools.

Cluster topology controller:

internal/controllers/topology/cluster/blueprint_test.go 🌱 Add unit tests for CC MP blueprint, current_state, & desired_state #9348 @willie-yao
internal/controllers/topology/cluster/current_state_test.go 🌱 Add unit tests for CC MP blueprint, current_state, & desired_state #9348 @willie-yao
internal/controllers/topology/cluster/desired_state_test.go 🌱 Add unit tests for CC MP blueprint, current_state, & desired_state #9348 @willie-yao
internal/controllers/topology/cluster/reconcile_state_test.go 🌱 Add unit tests for reconcile_state, cluster_controller, & conditions #9380 @willie-yao
internal/controllers/topology/cluster/cluster_controller_test.go 🌱 Add unit tests for reconcile_state, cluster_controller, & conditions #9380 @willie-yao
internal/controllers/topology/cluster/conditions_test.go 🌱 Add unit tests for reconcile_state, cluster_controller, & conditions #9380 @willie-yao
internal/controllers/topology/cluster/scope/scope_test.go @razashahid107
internal/controllers/topology/cluster/scope/state_test.go
internal/controllers/topology/cluster/scope/upgradetracker_test.go

Inline patching:

internal/controllers/topology/cluster/patches/inline/json_patch_generator_test.go 🌱 Add MachinePool test cases #9474 @johannesfrey
internal/controllers/topology/cluster/patches/variables/variables_test.go 🌱 Add MachinePool test cases to variables tests #9372 @johannesfrey
internal/controllers/topology/cluster/patches/engine_test.go 🌱 Add MachinePool test cases to engine tests #9373 @johannesfrey

Webhooks:

internal/topology/check/compatibility_test.go
internal/webhooks/cluster_test.go
internal/webhooks/clusterclass_test.go
internal/webhooks/patch_validation_test.go

ClusterClass controller:

internal/controllers/clusterclass/clusterclass_controller_test.go

External patching: test extension:

test/extension/handlers/topologymutation/handler_test.go

clusterctl alpha topology dry-run:

cmd/clusterctl/client/cluster/topology_test.go

Verification:

Run unit & integration tests with coverage and check if we missed something

e2e tests

First wave:
- Quickstart (quick_start.go, flavor: "topology,topology-dualstack-.*") 🌱 Add MachinePools to Topology Quickstart E2E Templates #9393 @willie-yao
- Kubernetes upgrade (cluster_upgrade.go, flavors: "upgrades,topology,kcp-scale-in") 🌱 Add MachinePools to topology upgrade test #9502 @willie-yao
  - The ClusterClass code path should use MPs via CC
Afterwards:
- ClusterClass changes & rollout (clusterclass_changes.go & clusterclass_rollout, flavor: "topology")
  - Ensure assert also verifies the labels/annotations (also in-place rollout of them)
- clusterctl upgrade (clusterctl_upgrade.go, workload flavor: "topology")
- Runtime SDK (cluster_upgrade_runtimesdk.go, flavor: "upgrades-runtimesdk")
- Self-hosted (self_hosted.go, flavor: "topology")
- Autoscaler (autoscaler.go, flavor: "topology-autoscaler")
- Dualstack e2e tests
Audit if there are any Cluster YAMLs left in the test folder which don't have a machinePool in Cluster.spec.topology

Misc

TBD: internal/controllers/topology/machinepool might be not needed as MachinePool.reconcileDelete deletes templates

Documentation

Update proposal
Update docs (probably mostly book)
- lifecycle hooks
- builtin variables

Later

TBD: consider de-duplicating MD/MP code
- [Stefan]: Did some deduplication. I personally would keep everything else separate, but open to discussions
TBD: MachinePool upgrading use Node list instead of Node get
MHC for MachinePools: depends on MachinePool Machines

/kind feature

sbueringer · 2022-01-26T08:06:47Z

Thx for filing this issue.
As mentioned in the linked thread, +1 if we don't have concerns about embedding support for an experimental feature in another experimental feature (which then leaks into the non-experimental Cluster resource :)).

Nit regarding the YAML. I would imagine it to look ~ like this: (essentially ~ s/machineDeployments/machinePools/ +/- some additional fields to configure the MachinePool might make sense)

apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: example
spec:
  controlPlane:
...
  workers:
    machinePools:
    - class: example
       template:
          bootstrap:
            ref:
              apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
              kind: KubeadmConfigTemplate
              name: example-pool-0
          infrastructure:
            ref:
              apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
              kind: AzureManagedMachinePool
              name: example-pool-0
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: example
  topology:
    class: example
    version: v1.22.4
    ...
    workers:
      machinePools:
      - class: "example"
        name: "mp-0"
        replicas: 10

fabriziopandini · 2022-01-26T12:44:33Z

/milestone v1.2

k8s-triage-robot · 2022-04-26T13:15:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pydctw · 2022-04-26T14:43:14Z

/remove-lifecycle stale

k8s-triage-robot · 2022-07-25T15:41:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini · 2022-07-26T09:26:57Z

/remove-lifecycle stale

fabriziopandini · 2022-07-29T17:42:44Z

/triage accepted

k8s-triage-robot · 2022-10-27T21:06:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini · 2022-11-02T21:46:25Z

/lifecycle frozen
/help

this is something to complete before moving CC or Machine pools to next phases in graduation

k8s-ci-robot · 2022-11-02T21:46:26Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/lifecycle frozen
/help

this is something to complete before moving CC or Machine pools to next phases in graduation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

richardcase · 2023-02-03T12:59:36Z

We'd love this feature for CAPA & CAPG and so i can work on this:

/assign

abdurrehman107 · 2023-09-07T18:49:36Z

Hey everyone
I'd love to work on the following files for unit tests:

internal/controllers/topology/cluster/scope_test.go
internal/controllers/topology/cluster/state_test.go

I'll get started working on this and I'll update if I get any questions.

sbueringer · 2023-09-07T19:16:13Z

I assigned you above. Thank you for helping!

abdurrehman107 · 2023-09-07T19:17:47Z

Thank you @sbueringer

sbueringer · 2023-09-08T16:43:52Z

@willie-yao Finally found the time to take a look at our e2e tests. I identified all e2e tests which use ClusterClass and then should eventually also cover MachinePool. I suggest we first start with a PR to cover quickstart. I'm not sure if that will already make it necessary to modify other tests (as a lot of them are using the same "topology" cluster template). For now I would try to not parallelize and see how the first few PRs go. WDYT?

willie-yao · 2023-09-08T17:40:46Z

@sbueringer That sounds perfect! Do you mean we shouldn't parallelize the E2E tests with unit tests, or are we good to start working on E2E now?

sbueringer · 2023-09-09T04:31:02Z

@willie-yao We're good to work on it now. Just wouldn't parallelize e2e test PRs. I suspect that once we touch one of them we might have to fix a bunch more (not sure but could happen because we have to change the "topology" flavor which is used in a lot of e2e tests). So parallelizing could lead to duplicate work

willie-yao · 2023-09-11T18:52:19Z

@sbueringer Will be working on Quickstart (quick_start.go, flavor: "topology,topology-dualstack-.*") (skip additional ownerref validation)

abdurrehman107 · 2023-09-13T07:45:09Z

Hey @sbueringer. So I was going through scope_test.go to add unit tests and I do not think there is anything to be added in it. I might be mistaken. Can you please verify?

chrischdi · 2023-09-13T14:57:15Z

Hey @sbueringer. So I was going through scope_test.go to add unit tests and I do not think there is anything to be added in it. I might be mistaken. Can you please verify?

Agree, there's nothing left for scope_test.go regards to adding test coverage 👍 we can mark that as resolved / skip that one. (Besides that it already has 100% coverage :-) )

sbueringer · 2023-09-20T08:45:40Z

@willie-yao Updated the e2e test list. #9393 also already fixed the ownerref part, which is great 🎉

killianmuldoon · 2023-10-30T16:07:12Z

Note: It looks like there's a lot of flakiness that has been introduced since adding MachinePools to the ClusterClass self-hosted tests. Those flakes look to be blocking the release currently.

I'm wondering if someone with more expertise in MachinePools is able to take a look: #9522.

Otherwise we might want to remove MPs from the impacted tests ahead of the 1.6 release.

sbueringer · 2023-10-30T16:14:49Z

We could consider disabling them on the release branch to have a clear signal for the release and continue debugging on main?

fabriziopandini · 2024-01-19T17:44:14Z

@sbueringer @willie-yao can we close this issue (eventually moving pending tasks to separate issues if they are a small number, or to a new issue where we can track the Part 2 of this work starting from a clean state)

willie-yao · 2024-01-19T17:49:52Z

can we close this issue (eventually moving pending tasks to separate issues if they are a small number, or to a new issue where we can track the Part 2 of this work starting from a clean state)

I think that's a good idea! It's a bit harder to track with everything on one issue and all that's left are some unit tests.

fabriziopandini · 2024-01-19T19:37:24Z

/close

@willie-yao will follow up with new issues for what is left

k8s-ci-robot · 2024-01-19T19:37:29Z

@fabriziopandini: Closing this issue.

In response to this:

/close

@willie-yao will follow up with new issues for what is left

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2024-01-22T11:49:29Z

Not entirely sure what the upside is to move half of an umbrella issue to a new issue, but new issue looks good to me. So let's keep it as is

fabriziopandini · 2024-01-22T13:26:59Z

Not entirely sure what the upside is to move half of an umbrella issue to a new issue

it is just to have a cleaner scope for 1.7 (and possibly an owner of it)
cc @jackfrancis as well

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 25, 2022

fabriziopandini added the area/topology label Jan 26, 2022

k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022

chrischdi mentioned this issue May 17, 2022

🐛 Use Kubernetes 1.24 in quickstart and CAPD, bump to kind v0.14 #6513

Merged

3 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2022

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

AAkindele mentioned this issue Oct 13, 2022

Add support for ClusterClass and managed clusters kubernetes-sigs/cluster-api-provider-azure#2684

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2022

pydctw mentioned this issue Nov 12, 2022

Add AWSManagedControlPlaneTemplate type kubernetes-sigs/cluster-api-provider-aws#3166

Open

k8s-ci-robot assigned richardcase Feb 3, 2023

willie-yao mentioned this issue Sep 6, 2023

🌱 Add unit tests for reconcile_state, cluster_controller, & conditions #9380

Merged

sbueringer mentioned this issue Sep 11, 2023

🌱 Add ownerRefs to BootstrapConfig/InfraMachinePool in classy Clusters #9389

Merged

willie-yao mentioned this issue Sep 11, 2023

🌱 Add MachinePools to Topology Quickstart E2E Templates #9393

Merged

abdurrehman107 mentioned this issue Sep 20, 2023

🌱 Added unit tests for state_test.go #9468

Closed

killianmuldoon mentioned this issue Sep 21, 2023

🐛 Drop MachinePools from Dualstack tests #9477

Merged

willie-yao mentioned this issue Sep 26, 2023

🌱 Add MachinePools to topology upgrade test #9502

Merged

killianmuldoon mentioned this issue Sep 27, 2023

✨ Add MachinePool Machine implementation to CAPD components #8842

Merged

willie-yao mentioned this issue Nov 10, 2023

🌱 Add MachinePools to Runtime SDK and Rollout tests #9703

Merged

abdurrehman107 mentioned this issue Nov 28, 2023

🌱 Added test for state_test.go #9789

Closed

k8s-ci-robot closed this as completed Jan 19, 2024

willie-yao mentioned this issue Jan 19, 2024

Follow-up Tasks for MachinePool support in ClusterClass #10028

Closed

willie-yao mentioned this issue May 28, 2024

🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachinePool workers support in ClusterClass #5991

MachinePool workers support in ClusterClass #5991

dlipovetsky commented Jan 25, 2022 •

edited by sbueringer

Loading

sbueringer commented Jan 26, 2022 •

edited

Loading

fabriziopandini commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

pydctw commented Apr 26, 2022

k8s-triage-robot commented Jul 25, 2022

fabriziopandini commented Jul 26, 2022

fabriziopandini commented Jul 29, 2022

k8s-triage-robot commented Oct 27, 2022

fabriziopandini commented Nov 2, 2022

k8s-ci-robot commented Nov 2, 2022

richardcase commented Feb 3, 2023 •

edited

Loading

abdurrehman107 commented Sep 7, 2023 •

edited

Loading

sbueringer commented Sep 7, 2023 •

edited

Loading

abdurrehman107 commented Sep 7, 2023

sbueringer commented Sep 8, 2023 •

edited

Loading

willie-yao commented Sep 8, 2023

sbueringer commented Sep 9, 2023

willie-yao commented Sep 11, 2023 •

edited

Loading

abdurrehman107 commented Sep 13, 2023

chrischdi commented Sep 13, 2023 •

edited

Loading

sbueringer commented Sep 20, 2023

killianmuldoon commented Oct 30, 2023

sbueringer commented Oct 30, 2023

fabriziopandini commented Jan 19, 2024

willie-yao commented Jan 19, 2024

fabriziopandini commented Jan 19, 2024

k8s-ci-robot commented Jan 19, 2024

sbueringer commented Jan 22, 2024

fabriziopandini commented Jan 22, 2024

MachinePool workers support in ClusterClass #5991

MachinePool workers support in ClusterClass #5991

Comments

dlipovetsky commented Jan 25, 2022 • edited by sbueringer Loading

Tasks

APIs #8820

Controllers, webhooks, ... #9016

unit & integration tests

e2e tests

Misc

Documentation

Later

sbueringer commented Jan 26, 2022 • edited Loading

fabriziopandini commented Jan 26, 2022

k8s-triage-robot commented Apr 26, 2022

pydctw commented Apr 26, 2022

k8s-triage-robot commented Jul 25, 2022

fabriziopandini commented Jul 26, 2022

fabriziopandini commented Jul 29, 2022

k8s-triage-robot commented Oct 27, 2022

fabriziopandini commented Nov 2, 2022

k8s-ci-robot commented Nov 2, 2022

Guidelines

richardcase commented Feb 3, 2023 • edited Loading

abdurrehman107 commented Sep 7, 2023 • edited Loading

sbueringer commented Sep 7, 2023 • edited Loading

abdurrehman107 commented Sep 7, 2023

sbueringer commented Sep 8, 2023 • edited Loading

willie-yao commented Sep 8, 2023

sbueringer commented Sep 9, 2023

willie-yao commented Sep 11, 2023 • edited Loading

abdurrehman107 commented Sep 13, 2023

chrischdi commented Sep 13, 2023 • edited Loading

sbueringer commented Sep 20, 2023

killianmuldoon commented Oct 30, 2023

sbueringer commented Oct 30, 2023

fabriziopandini commented Jan 19, 2024

willie-yao commented Jan 19, 2024

fabriziopandini commented Jan 19, 2024

k8s-ci-robot commented Jan 19, 2024

sbueringer commented Jan 22, 2024

fabriziopandini commented Jan 22, 2024

dlipovetsky commented Jan 25, 2022 •

edited by sbueringer

Loading

sbueringer commented Jan 26, 2022 •

edited

Loading

richardcase commented Feb 3, 2023 •

edited

Loading

abdurrehman107 commented Sep 7, 2023 •

edited

Loading

sbueringer commented Sep 7, 2023 •

edited

Loading

sbueringer commented Sep 8, 2023 •

edited

Loading

willie-yao commented Sep 11, 2023 •

edited

Loading

chrischdi commented Sep 13, 2023 •

edited

Loading