InfrastructureMachineSpec.FailureDomain is required to be a String #8096

mdbooth · 2023-02-13T10:25:18Z

While writing kubernetes-sigs/cluster-api-provider-openstack#1466 I encountered this code

cluster-api/internal/controllers/machine/machine_controller_phases.go

Lines 305 to 314 in 327aae2

    
           var failureDomain string 
        
           err = util.UnstructuredUnmarshalField(infraConfig, &failureDomain, "spec", "failureDomain") 
        
           switch { 
        
           case err == util.ErrUnstructuredFieldNotFound: // no-op 
        
           case err != nil: 
        
           	return ctrl.Result{}, errors.Wrapf(err, "failed to retrieve failure domain from infrastructure provider for Machine %q in namespace %q", m.Name, m.Namespace) 
        
           default: 
        
           	m.Spec.FailureDomain = pointer.String(failureDomain) 
        
           }

which attempts to copy the contents of a field called failureDomain in the infrastructure machine spec to the machine spec.

This places an additional constraint on the infrastructure machine spec in that a field called failureDomain, if it exists, must be a string. If this limitation is intentional it is not documented with the rest of the failure domain interface in the cluster API book: https://cluster-api.sigs.k8s.io/developer/providers/cluster-infrastructure.html

The behaviour of this code is to overwrite MachineSpec.FailureDomain with the value in InfrastructureMachineSpec.FailureDomain if it is defined. There are only 2 cases where this would do anything:

When the infrastructure provider may schedule a machine in a failure domain other than the one which was requested.
When no failure domain was explicitly requested, but we want to communicate to CAPI which one was scheduled by the infrastructure provider anyway.

Hopefully no providers have the behaviour in (1). It would be almost impossible to provide reliable infrastructure in this case.

So my best guess is that the intention was (2)? However, in this case the CAPI controller isn't going to use the failure domain so we've placed a restriction on the infrastructure provider for aesthetic reasons?

Does anybody know if any provider is using this behaviour, and why?
If no provider is using it, could it be removed?
Either way, should it be documented?

Anything else you would like to add:
The specific error produced is:

E0213 10:22:43.407582      11 controller.go:329] "Reconciler error" err="failed to retrieve failure domain from infrastructure provider for Machine \"cluster-mbooth-control-plane-7k725\" in namespace \"mbooth\": failed to json-decode field \"spec.failureDomain\" value from \"infrastructure.cluster.x-k8s.io/v1alpha6, Kind=OpenStackMachine\": json: cannot unmarshal object into Go value of type string" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="mbooth/cluster-mbooth-control-plane-7k725" namespace="mbooth" name="cluster-mbooth-control-plane-7k725" reconcileID=d1a829f9-2485-4ab0-a9dc-01575684fb12

/kind bug

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2023-02-13T10:35:46Z

If I'm reading this right the infrastructureMachine field is not designed to be used with newer versions of the API - it exists for backward compatibility.

From https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure.html:

failureDomain (string): the string identifier of the failure domain the instance is running in for the purposes of backwards compatibility and migrating to the v1alpha3 FailureDomain support (where FailureDomain is specified in Machine.Spec.FailureDomain). This field is meant to be temporary to aid in migration of data that was previously defined on the provider type and providers will be expected to remove the field in the next version that provides breaking API changes, favoring the value defined on Machine.Spec.FailureDomain instead. If supporting conversions from previous types, the provider will need to support a conversion from the provider-specific field that was previously used to the failureDomain field to support the automated migration path.

mdbooth · 2023-02-13T10:44:58Z

Ah, so it was an upgrade thing at some point?

Sounds like we should remove it? It's just a trip hazard now.

killianmuldoon · 2023-02-13T10:46:24Z

Sounds like we should remove it? It's just a trip hazard now.

I think this will be part of the conversation around removing the v1alpha3 API types in #8071. Part of that discussion involves removing old code which only exists for migrating and dealing with issues on older API types.

mdbooth · 2023-02-13T11:04:07Z

I think I will move our 'hydrated' failure domain to the InfrastructureMachineStatus instead. This is a bit of a workaround, but it was a reasonable candidate for somewhere to stash it anyway. It's a bit of a grey area because, while the failure domain is part of the specification of the machine, it's not a field I expect a user to typically enter manually: the machine controller copies the hydrated value based on the reference in the MachineSpec before creating the machine.

The primary issue here is that we need to ensure the InfrastructureMachineSpec remains immutable for its entire lifetime. This is complex to do when you only have a reference to some of it, and the referenced object may be altered or deleted. The simplest solution is to just copy it and ignore the reference thereafter. This also leaves us maximum flexibility for wider scoped solutions in the future.

sbueringer · 2023-02-14T13:14:44Z

I think this will be part of the conversation around removing the v1alpha3 API types in #8071. Part of that discussion involves removing old code which only exists for migrating and dealing with issues on older API types.

Just referencing my comment from here: #8071 (comment)

My plan was to remove the v1alpha3 & v1alpha4 API types and packages in core CAPI. I wasn't planning to touch the contract stuff.

I think we should have a separate discussion & issue about leftovers of previous contracts. In my mind the contract covers how core Cluster API interacts with providers (specifically bootstrap, control plane, infra resources). As far as I can tell a specific Cluster API version only is supposed to support one contract "version" (e.g. v1beta1) at a time. If we still have code in Cluster API today that covers "previous contract versions" I think we now have to consider this as part of the v1beta1 contract. We should figure out what code that exactly is and then I think we should deprecate this behavior ("with the v1beta1 contract") and can eventually get rid of it once we bump the contract to v1beta2.

fabriziopandini · 2023-02-16T22:24:54Z

/triage accepted

k8s-triage-robot · 2024-02-16T22:49:56Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini · 2024-04-11T19:40:14Z

/priority important-longterm

k8s-triage-robot · 2024-07-10T19:57:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sbueringer · 2024-07-11T06:02:36Z

/remove-lifecycle stale

fabriziopandini · 2024-07-31T14:27:23Z

I just checked https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure and failure domain is defined as failureDomain (string)

We can improve, but I think this answer the question in the issue above

Orthogonal, as I said many times, there is huge room for improvements in how we manage failure domains...

/close

k8s-ci-robot · 2024-07-31T14:27:27Z

@fabriziopandini: Closing this issue.

In response to this:

I just checked https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure and failure domain is defined as failureDomain (string)

We can improve, but I think this answer the question in the issue above

Orthogonal, as I said many times, there is huge room for improvements in how we manage failure domains...

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 13, 2023

mdbooth changed the title ~~🐛InfrastructureSpec.FailureDomain is required to be a String~~ 🐛InfrastructureMachineSpec.FailureDomain is required to be a String Feb 13, 2023

killianmuldoon mentioned this issue Feb 13, 2023

🌱 Deprecate v1alpha3 & v1alpha4 #8071

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 16, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 16, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 11, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 10, 2024

sbueringer changed the title ~~🐛InfrastructureMachineSpec.FailureDomain is required to be a String~~ InfrastructureMachineSpec.FailureDomain is required to be a String Jul 11, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2024

fabriziopandini added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 31, 2024

k8s-ci-robot closed this as completed Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfrastructureMachineSpec.FailureDomain is required to be a String #8096

InfrastructureMachineSpec.FailureDomain is required to be a String #8096

mdbooth commented Feb 13, 2023

killianmuldoon commented Feb 13, 2023

mdbooth commented Feb 13, 2023

killianmuldoon commented Feb 13, 2023

mdbooth commented Feb 13, 2023

sbueringer commented Feb 14, 2023

fabriziopandini commented Feb 16, 2023

k8s-triage-robot commented Feb 16, 2024

fabriziopandini commented Apr 11, 2024

k8s-triage-robot commented Jul 10, 2024

sbueringer commented Jul 11, 2024

fabriziopandini commented Jul 31, 2024

k8s-ci-robot commented Jul 31, 2024

InfrastructureMachineSpec.FailureDomain is required to be a String #8096

InfrastructureMachineSpec.FailureDomain is required to be a String #8096

Comments

mdbooth commented Feb 13, 2023

killianmuldoon commented Feb 13, 2023

mdbooth commented Feb 13, 2023

killianmuldoon commented Feb 13, 2023

mdbooth commented Feb 13, 2023

sbueringer commented Feb 14, 2023

fabriziopandini commented Feb 16, 2023

k8s-triage-robot commented Feb 16, 2024

fabriziopandini commented Apr 11, 2024

k8s-triage-robot commented Jul 10, 2024

sbueringer commented Jul 11, 2024

fabriziopandini commented Jul 31, 2024

k8s-ci-robot commented Jul 31, 2024