Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Failure Domains in CAPM3 #402

Open
Arvinderpal opened this issue Nov 24, 2021 · 28 comments
Open

Support for Failure Domains in CAPM3 #402

Arvinderpal opened this issue Nov 24, 2021 · 28 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@Arvinderpal
Copy link
Contributor

User Story

As an operator who has placed their baremetal infrastructure across different failure domains (FDs), I would like CAPM3 to associate Nodes with BMHs from the desired failure domain.

Detailed Description

CAPI supports failure domains for both control-plane and worker nodes (see CAPI provider contract for Provider Machine as well Provider Cluster types). Here is the general flow:

  1. CAPI will look for the set of FailureDomains in the ProviderCluster.Spec
  2. The field is copied to the Cluster.Status.FailureDomains
  3. During KCP or MD scale up events, a FD will be choosen from this set and it's value placed in Machine.Spec.FailureDomain. Currently, CAPI tries to equally balance Machines across all FDs.
  4. It is expected that providers will use this chosen FD in the Machine.Spec in deciding where to place the provider specific machine. In the case of metal3, we want CAPM3 to associate the Metal3Machine with the corresponding BMH in the desired FD.

BMH Selection using Labels.

  1. The operator labels the BMH resource based on the physical location of the host. For example, the following standard label could be used on the BMH:
    infrastructure.cluster.x-k8s.io/failure-domain=<my-fd-1>
  2. Today, CAPM3 chooseHost() func associates a Metal3Machine with a specific BMH based on labels specified in Metal3Machine.Spec.HostSelector.MatchLabels. We can expand this capability.
  3. The HostSelector field is used to narrow down the set of available BMHs that meet the selection criteria. When FDs are being utilized, we can simply insert the above label into the HostSelector.MatchLabels.

Anything else you would like to add:

Related issues:
kubernetes-sigs/cluster-api#5666
kubernetes-sigs/cluster-api#5667

/kind feature

@metal3-io-bot metal3-io-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 24, 2021
@Arvinderpal
Copy link
Contributor Author

/assign

@Arvinderpal
Copy link
Contributor Author

@fmuyassarov @kashifest @furkatgofurov7 @maelk
Appreciate your thoughts on this. I would be happy to put together a PR for this.

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Jan 21, 2022

@Arvinderpal hi! Thanks for taking this up here and sorry for the late reply. The addition looks interesting and by going through the linked issues and related PRs, some works on improving the FD support in CAPI seem to be ongoing. Also, just wondering, how is the situation with other providers (BM concerned), do they already support this feature?

@Rozzii
Copy link
Member

Rozzii commented Feb 3, 2022

/triage accepted

@metal3-io-bot metal3-io-bot added the triage/accepted Indicates an issue is ready to be actively worked on. label Feb 3, 2022
@Arvinderpal
Copy link
Contributor Author

Sorry about the delay. Support for FDs with control plane (KCP) nodes is supported within capi. I believe all providers also follow that approach.
For worker nodes, there is still some discussion to be had with the broader capi community. There is some initial discussion in kubernetes-sigs/cluster-api#5666 and my PR linked within it.

I think we can start with CP nodes. Any thoughts on the approach I outlined above in the issue description?

@furkatgofurov7
Copy link
Member

@furkatgofurov7 CAPV does, by https://doc.crds.dev/github.com/kubernetes-sigs/cluster-api-provider-vsphere/infrastructure.cluster.x-k8s.io/VSphereDeploymentZone/v1beta1@v1.0.2 and https://doc.crds.dev/github.com/kubernetes-sigs/cluster-api-provider-vsphere/infrastructure.cluster.x-k8s.io/VSphereFailureDomain/v1beta1@v1.0.2

the Provider for MAAS also supports them, at least in spec https://doc.crds.dev/github.com/spectrocloud/cluster-api-provider-maas

@MaxRink thanks.

Sorry about the delay. Support for FDs with control plane (KCP) nodes is supported within capi. I believe all providers also follow that approach.
For worker nodes, there is still some discussion to be had with the broader capi community. There is some initial discussion in kubernetes-sigs/cluster-api#5666 and my PR linked within it.

Got it, thanks for the info and I went through them some time ago.

I think we can start with CP nodes. Any thoughts on the approach I outlined above in the issue description?

Agree, but I would suggest opening a proposal for community review and discussing the details of the implementation there as we usually do for these kinds of new features.

@Arvinderpal
Copy link
Contributor Author

Thanks @furkatgofurov7
I'll put a proposal together and share it.

@Arvinderpal
Copy link
Contributor Author

Here is the PR with the proposal: metal3-io/metal3-docs#249
@furkatgofurov7 @MaxRink @Rozzii PTAL

I will bring it up during our next metal3 office hours as well. Thank you

@Arvinderpal
Copy link
Contributor Author

@furkatgofurov7 @MaxRink @Rozzii PTAL at the proposal: metal3-io/metal3-docs#249
Would appreciate your feedback. Thanks

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 15, 2022
@furkatgofurov7
Copy link
Member

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2022
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2022
@furkatgofurov7
Copy link
Member

/remove-lifecycle stale

@Arvinderpal Hi, the proposal has been merged for this feature some time ago, thanks for working on it! Are there plans to implement it in CAPM3 soon?

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2022
@sf1tzp
Copy link

sf1tzp commented Dec 12, 2022

Hey @furkatgofurov7 @Arvinderpal, I'd like to give this one a shot if I may. I have a draft PR created at the moment, but still need to familiarize myself with the testing & polishing requirements of this repo. Hope I get time this week to make some more progress on it.

@furkatgofurov7
Copy link
Member

@f1tzpatrick hi, sure go ahead!

/unassign @Arvinderpal
/assign @f1tzpatrick

@metal3-io-bot metal3-io-bot assigned sf1tzp and unassigned Arvinderpal Dec 12, 2022
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 12, 2023
@sf1tzp
Copy link

sf1tzp commented Mar 13, 2023

Hey, sorry for the delay on this one. It's still on my todo list! It's been busy for me lately, but I hope to get this tested sometime soon.

I'll keep you posted! 😃

@furkatgofurov7
Copy link
Member

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2023
@furkatgofurov7
Copy link
Member

/lifecycle active

@metal3-io-bot metal3-io-bot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Mar 13, 2023
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. labels Jun 11, 2023
@Rozzii
Copy link
Member

Rozzii commented Jun 21, 2023

hi @f1tzpatrick is this topic still on your TODO ?
/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2023
@sf1tzp
Copy link

sf1tzp commented Jun 21, 2023

Hey @Rozzii, it is but I'm sorry it keeps getting pushed to the back burner. I made some progress in #793 but could use a hand testing it. The metal3-dev-env is still new to me and I haven't had enough time to really sit down and go through the process.

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2023
@Rozzii
Copy link
Member

Rozzii commented Sep 20, 2023

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2023
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2023
@Rozzii
Copy link
Member

Rozzii commented Dec 19, 2023

/remove-lifecycle stale
/lifecycle frozen
I will move this to frozen, this seems to be a legit feature but it keeps coming in&out of stale.

@metal3-io-bot metal3-io-bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 19, 2023
@sf1tzp
Copy link

sf1tzp commented Jan 11, 2024

@Rozzii thanks, and sorry for the inconvenience. If I get another chance to return to this in 2024 I'll let you know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue is ready to be actively worked on.
Projects
Status: CAPM3 on hold / blocked
Development

No branches or pull requests

6 participants