Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachinePool workers support in ClusterClass #5991

Closed
19 of 39 tasks
dlipovetsky opened this issue Jan 25, 2022 · 50 comments
Closed
19 of 39 tasks

MachinePool workers support in ClusterClass #5991

dlipovetsky opened this issue Jan 25, 2022 · 50 comments
Assignees
Labels
area/clusterclass Issues or PRs related to clusterclass help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@dlipovetsky
Copy link
Contributor

dlipovetsky commented Jan 25, 2022

User Story

As a user, I would like to create a ClusterClass-based Cluster where workers are managed using MachinePools instead of MachineDeployments. This allows me, for example, to use ClusterClass with an infrastructure provider "flavor" like Cluster API Azure Provider Managed AKS, which supports only MachinePools.

Detailed Description

I want to be able to define a ClusterClass where workers are managed using MachinePools instead of MachineDeployments.

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: example
spec:
  controlPlane:
...
  workers:
    machinePools:
      - class: example
        template:
          spec:
            clusterName: example
            replicas: 10
            template:
              spec:
                bootstrap:
                  ref:
                    apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
                    kind: KubeadmConfigTemplate
                    name: example-pool-0
                clusterName: example
                infrastructureRef:
                  apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
                  kind: AzureManagedMachinePool
                  name: example-pool-0
                version: 1.22.4
  infrastructure:
...

Anything else you would like to add:

This issue follows up on a slack conversation.

Tasks

[EDIT: by @sbueringer] After further discussion we would implement this with the following sub-tasks:

APIs #8820

  • Extend APIs (api/v1beta1)
    • Cluster
      • WorkersTopology: Add MachinePoolTopology slice
    • ClusterClass
      • WorkersClass: Add MachinePoolClass slice
      • PatchSelectorMatch: Add MachinePool
    • Note: excluding MachineHealthCheck fields

Controllers, webhooks, ... #9016

  • Reconcile MachinePools in Cluster topology (internal/controllers/topology/cluster)
    • Basic reconcile
      • blueprint.go: also get MachinePool blueprint
      • current_state.go: also get MachinePool current state
      • desired_state.go: calculate desired MachinePool like MachineDeployment
      • reconcile_state.go: reconcile MachinePool like MachineDeployment
      • cluster_controller.go: also watch MachinePools
      • Surface MachinePool state like MachineDeployment state in conditions (conditions.go)
    • Support patching MachinePools:
      • Extend patch engine to also handle MachinePools
      • Extend json patch generator to handle MachinePools
      • Add MachinePool builtin variables
  • Add corresponding validation in internal/webhooks/
    • Includes additions in internal/topology/check/ & internal/topology/variables/{cluster_variable_defaulting.go, cluster_variable_validation.go}
  • Extend internal/controllers/ClusterClass
    • Add the same that we already have for MachineDeployments
  • Extend CAPD
    • Add DockerMachinePoolTemplate
  • Extend test/extension
    • Also patch DockerMachinePoolTemplate
  • Extend quickstart test to also cover MachinePools

unit & integration tests

Take a look at the existing tests and the code added for MachinePools and add test coverage accordingly for MachinePools.

Cluster topology controller:

Inline patching:

Webhooks:

  • internal/topology/check/compatibility_test.go
  • internal/webhooks/cluster_test.go
  • internal/webhooks/clusterclass_test.go
  • internal/webhooks/patch_validation_test.go

ClusterClass controller:

  • internal/controllers/clusterclass/clusterclass_controller_test.go

External patching: test extension:

  • test/extension/handlers/topologymutation/handler_test.go

clusterctl alpha topology dry-run:

  • cmd/clusterctl/client/cluster/topology_test.go

Verification:

  • Run unit & integration tests with coverage and check if we missed something

e2e tests

  • First wave:
  • Afterwards:
    • ClusterClass changes & rollout (clusterclass_changes.go & clusterclass_rollout, flavor: "topology")
      • Ensure assert also verifies the labels/annotations (also in-place rollout of them)
    • clusterctl upgrade (clusterctl_upgrade.go, workload flavor: "topology")
    • Runtime SDK (cluster_upgrade_runtimesdk.go, flavor: "upgrades-runtimesdk")
    • Self-hosted (self_hosted.go, flavor: "topology")
    • Autoscaler (autoscaler.go, flavor: "topology-autoscaler")
    • Dualstack e2e tests
  • Audit if there are any Cluster YAMLs left in the test folder which don't have a machinePool in Cluster.spec.topology

Misc

  • TBD: internal/controllers/topology/machinepool might be not needed as MachinePool.reconcileDelete deletes templates

Documentation

  • Update proposal
  • Update docs (probably mostly book)
    • lifecycle hooks
    • builtin variables

Later

  • TBD: consider de-duplicating MD/MP code
    • [Stefan]: Did some deduplication. I personally would keep everything else separate, but open to discussions
  • TBD: MachinePool upgrading use Node list instead of Node get
  • MHC for MachinePools: depends on MachinePool Machines

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 25, 2022
@sbueringer
Copy link
Member

sbueringer commented Jan 26, 2022

Thx for filing this issue.
As mentioned in the linked thread, +1 if we don't have concerns about embedding support for an experimental feature in another experimental feature (which then leaks into the non-experimental Cluster resource :)).

Nit regarding the YAML. I would imagine it to look ~ like this: (essentially ~ s/machineDeployments/machinePools/ +/- some additional fields to configure the MachinePool might make sense)

apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: example
spec:
  controlPlane:
...
  workers:
    machinePools:
    - class: example
       template:
          bootstrap:
            ref:
              apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
              kind: KubeadmConfigTemplate
              name: example-pool-0
          infrastructure:
            ref:
              apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
              kind: AzureManagedMachinePool
              name: example-pool-0
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: example
  topology:
    class: example
    version: v1.22.4
    ...
    workers:
      machinePools:
      - class: "example"
        name: "mp-0"
        replicas: 10

@fabriziopandini
Copy link
Member

/milestone v1.2

@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022
@pydctw
Copy link

pydctw commented Apr 26, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022
@fabriziopandini
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2022
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the v1.2 milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini
Copy link
Member

/triage accepted

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2022
@fabriziopandini
Copy link
Member

/lifecycle frozen
/help

this is something to complete before moving CC or Machine pools to next phases in graduation

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/lifecycle frozen
/help

this is something to complete before moving CC or Machine pools to next phases in graduation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 2, 2022
@richardcase
Copy link
Member

richardcase commented Feb 3, 2023

We'd love this feature for CAPA & CAPG and so i can work on this:

/assign

@abdurrehman107
Copy link
Contributor

abdurrehman107 commented Sep 7, 2023

Hey everyone
I'd love to work on the following files for unit tests:

  • internal/controllers/topology/cluster/scope_test.go
  • internal/controllers/topology/cluster/state_test.go

I'll get started working on this and I'll update if I get any questions.

@sbueringer
Copy link
Member

sbueringer commented Sep 7, 2023

I assigned you above. Thank you for helping!

@abdurrehman107
Copy link
Contributor

Thank you @sbueringer

@sbueringer
Copy link
Member

sbueringer commented Sep 8, 2023

@willie-yao Finally found the time to take a look at our e2e tests. I identified all e2e tests which use ClusterClass and then should eventually also cover MachinePool. I suggest we first start with a PR to cover quickstart. I'm not sure if that will already make it necessary to modify other tests (as a lot of them are using the same "topology" cluster template). For now I would try to not parallelize and see how the first few PRs go. WDYT?

@willie-yao
Copy link
Contributor

@sbueringer That sounds perfect! Do you mean we shouldn't parallelize the E2E tests with unit tests, or are we good to start working on E2E now?

@sbueringer
Copy link
Member

@willie-yao We're good to work on it now. Just wouldn't parallelize e2e test PRs. I suspect that once we touch one of them we might have to fix a bunch more (not sure but could happen because we have to change the "topology" flavor which is used in a lot of e2e tests). So parallelizing could lead to duplicate work

@willie-yao
Copy link
Contributor

willie-yao commented Sep 11, 2023

@sbueringer Will be working on Quickstart (quick_start.go, flavor: "topology,topology-dualstack-.*") (skip additional ownerref validation)

@abdurrehman107
Copy link
Contributor

Hey @sbueringer. So I was going through scope_test.go to add unit tests and I do not think there is anything to be added in it. I might be mistaken. Can you please verify?

@chrischdi
Copy link
Member

chrischdi commented Sep 13, 2023

Hey @sbueringer. So I was going through scope_test.go to add unit tests and I do not think there is anything to be added in it. I might be mistaken. Can you please verify?

Agree, there's nothing left for scope_test.go regards to adding test coverage 👍 we can mark that as resolved / skip that one. (Besides that it already has 100% coverage :-) )

@sbueringer
Copy link
Member

@willie-yao Updated the e2e test list. #9393 also already fixed the ownerref part, which is great 🎉

@killianmuldoon
Copy link
Contributor

Note: It looks like there's a lot of flakiness that has been introduced since adding MachinePools to the ClusterClass self-hosted tests. Those flakes look to be blocking the release currently.

I'm wondering if someone with more expertise in MachinePools is able to take a look: #9522.

Otherwise we might want to remove MPs from the impacted tests ahead of the 1.6 release.

@sbueringer
Copy link
Member

We could consider disabling them on the release branch to have a clear signal for the release and continue debugging on main?

@fabriziopandini
Copy link
Member

@sbueringer @willie-yao can we close this issue (eventually moving pending tasks to separate issues if they are a small number, or to a new issue where we can track the Part 2 of this work starting from a clean state)

@willie-yao
Copy link
Contributor

can we close this issue (eventually moving pending tasks to separate issues if they are a small number, or to a new issue where we can track the Part 2 of this work starting from a clean state)

I think that's a good idea! It's a bit harder to track with everything on one issue and all that's left are some unit tests.

@fabriziopandini
Copy link
Member

/close

@willie-yao will follow up with new issues for what is left

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close

@willie-yao will follow up with new issues for what is left

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

Not entirely sure what the upside is to move half of an umbrella issue to a new issue, but new issue looks good to me. So let's keep it as is

@fabriziopandini
Copy link
Member

Not entirely sure what the upside is to move half of an umbrella issue to a new issue

it is just to have a cleaner scope for 1.7 (and possibly an owner of it)
cc @jackfrancis as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterclass Issues or PRs related to clusterclass help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests