Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfrastructureMachineSpec.FailureDomain is required to be a String #8096

Closed
mdbooth opened this issue Feb 13, 2023 · 12 comments
Closed

InfrastructureMachineSpec.FailureDomain is required to be a String #8096

mdbooth opened this issue Feb 13, 2023 · 12 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@mdbooth
Copy link
Contributor

mdbooth commented Feb 13, 2023

While writing kubernetes-sigs/cluster-api-provider-openstack#1466 I encountered this code

var failureDomain string
err = util.UnstructuredUnmarshalField(infraConfig, &failureDomain, "spec", "failureDomain")
switch {
case err == util.ErrUnstructuredFieldNotFound: // no-op
case err != nil:
return ctrl.Result{}, errors.Wrapf(err, "failed to retrieve failure domain from infrastructure provider for Machine %q in namespace %q", m.Name, m.Namespace)
default:
m.Spec.FailureDomain = pointer.String(failureDomain)
}
which attempts to copy the contents of a field called failureDomain in the infrastructure machine spec to the machine spec.

This places an additional constraint on the infrastructure machine spec in that a field called failureDomain, if it exists, must be a string. If this limitation is intentional it is not documented with the rest of the failure domain interface in the cluster API book: https://cluster-api.sigs.k8s.io/developer/providers/cluster-infrastructure.html

The behaviour of this code is to overwrite MachineSpec.FailureDomain with the value in InfrastructureMachineSpec.FailureDomain if it is defined. There are only 2 cases where this would do anything:

  1. When the infrastructure provider may schedule a machine in a failure domain other than the one which was requested.
  2. When no failure domain was explicitly requested, but we want to communicate to CAPI which one was scheduled by the infrastructure provider anyway.

Hopefully no providers have the behaviour in (1). It would be almost impossible to provide reliable infrastructure in this case.

So my best guess is that the intention was (2)? However, in this case the CAPI controller isn't going to use the failure domain so we've placed a restriction on the infrastructure provider for aesthetic reasons?

  • Does anybody know if any provider is using this behaviour, and why?
  • If no provider is using it, could it be removed?
  • Either way, should it be documented?

Anything else you would like to add:
The specific error produced is:

E0213 10:22:43.407582      11 controller.go:329] "Reconciler error" err="failed to retrieve failure domain from infrastructure provider for Machine \"cluster-mbooth-control-plane-7k725\" in namespace \"mbooth\": failed to json-decode field \"spec.failureDomain\" value from \"infrastructure.cluster.x-k8s.io/v1alpha6, Kind=OpenStackMachine\": json: cannot unmarshal object into Go value of type string" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="mbooth/cluster-mbooth-control-plane-7k725" namespace="mbooth" name="cluster-mbooth-control-plane-7k725" reconcileID=d1a829f9-2485-4ab0-a9dc-01575684fb12

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 13, 2023
@mdbooth mdbooth changed the title 🐛InfrastructureSpec.FailureDomain is required to be a String 🐛InfrastructureMachineSpec.FailureDomain is required to be a String Feb 13, 2023
@killianmuldoon
Copy link
Contributor

If I'm reading this right the infrastructureMachine field is not designed to be used with newer versions of the API - it exists for backward compatibility.

From https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure.html:

failureDomain (string): the string identifier of the failure domain the instance is running in for the purposes of backwards compatibility and migrating to the v1alpha3 FailureDomain support (where FailureDomain is specified in Machine.Spec.FailureDomain). This field is meant to be temporary to aid in migration of data that was previously defined on the provider type and providers will be expected to remove the field in the next version that provides breaking API changes, favoring the value defined on Machine.Spec.FailureDomain instead. If supporting conversions from previous types, the provider will need to support a conversion from the provider-specific field that was previously used to the failureDomain field to support the automated migration path.

@mdbooth
Copy link
Contributor Author

mdbooth commented Feb 13, 2023

Ah, so it was an upgrade thing at some point?

Sounds like we should remove it? It's just a trip hazard now.

@killianmuldoon
Copy link
Contributor

Sounds like we should remove it? It's just a trip hazard now.

I think this will be part of the conversation around removing the v1alpha3 API types in #8071. Part of that discussion involves removing old code which only exists for migrating and dealing with issues on older API types.

@mdbooth
Copy link
Contributor Author

mdbooth commented Feb 13, 2023

I think I will move our 'hydrated' failure domain to the InfrastructureMachineStatus instead. This is a bit of a workaround, but it was a reasonable candidate for somewhere to stash it anyway. It's a bit of a grey area because, while the failure domain is part of the specification of the machine, it's not a field I expect a user to typically enter manually: the machine controller copies the hydrated value based on the reference in the MachineSpec before creating the machine.

The primary issue here is that we need to ensure the InfrastructureMachineSpec remains immutable for its entire lifetime. This is complex to do when you only have a reference to some of it, and the referenced object may be altered or deleted. The simplest solution is to just copy it and ignore the reference thereafter. This also leaves us maximum flexibility for wider scoped solutions in the future.

@sbueringer
Copy link
Member

I think this will be part of the conversation around removing the v1alpha3 API types in #8071. Part of that discussion involves removing old code which only exists for migrating and dealing with issues on older API types.

Just referencing my comment from here: #8071 (comment)

My plan was to remove the v1alpha3 & v1alpha4 API types and packages in core CAPI. I wasn't planning to touch the contract stuff.

I think we should have a separate discussion & issue about leftovers of previous contracts. In my mind the contract covers how core Cluster API interacts with providers (specifically bootstrap, control plane, infra resources). As far as I can tell a specific Cluster API version only is supposed to support one contract "version" (e.g. v1beta1) at a time. If we still have code in Cluster API today that covers "previous contract versions" I think we now have to consider this as part of the v1beta1 contract. We should figure out what code that exactly is and then I think we should deprecate this behavior ("with the v1beta1 contract") and can eventually get rid of it once we bump the contract to v1beta2.

@fabriziopandini
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 16, 2023
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 16, 2024
@fabriziopandini
Copy link
Member

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 11, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2024
@sbueringer sbueringer changed the title 🐛InfrastructureMachineSpec.FailureDomain is required to be a String InfrastructureMachineSpec.FailureDomain is required to be a String Jul 11, 2024
@sbueringer
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2024
@fabriziopandini fabriziopandini added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 31, 2024
@fabriziopandini
Copy link
Member

I just checked https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure and failure domain is defined as failureDomain (string)

We can improve, but I think this answer the question in the issue above

Orthogonal, as I said many times, there is huge room for improvements in how we manage failure domains...

/close

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

I just checked https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure and failure domain is defined as failureDomain (string)

We can improve, but I think this answer the question in the issue above

Orthogonal, as I said many times, there is huge room for improvements in how we manage failure domains...

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

6 participants