Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to change infra machine naming scheme #10463

Closed
lentzi90 opened this issue Apr 18, 2024 · 40 comments
Closed

Make it possible to change infra machine naming scheme #10463

lentzi90 opened this issue Apr 18, 2024 · 40 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@lentzi90
Copy link
Contributor

lentzi90 commented Apr 18, 2024

What would you like to be added (User Story)?

  • As an operator with existing tooling around the infra resources, I would like a way to influence the naming of the infra machines to reduce the impact of the breaking change that was made to how they are named.
  • As an operator I would like to change the naming scheme of the nodes during the lifecycle of the cluster since the end user may want a different scheme than what was used by the operators when creating the KCP or MD.
  • As an operator I would like to name the nodes in different ways than the Machines since the end user may benefit from a different naming scheme than what is used by the operators.

Detailed Description

There is a breaking change in v1.7.0 (#9833) that changes the way objects gets their names. Previously infra machines would get their name from the infra machine template, and this would then usually (?) propagate to the hostname and ultimately the node.
The change is that the infra machines now get their name from the machine. This is nice for matching the different objects without looking through provider IDs and references. However, the name also propagates to the node by default and that is not always ideal. However, it means old clusters now get their node names from something that didn't influence it in the past, causing confusion for users.

The name also propagates to many infra resources created by the infra provider (in many cases). This means that on upgrade to v1.7.0 we would suddenly get a completely different naming scheme of the cloud resources. We have unfortunately grown used to the old naming scheme and we even have tooling for tracking the cloud resources based on this. With the breaking change, all of this will stop working with no way to getting back the old behavior.

We have also gotten used to, and currently relies on, setting the node name prefix through the infra machine template. Since the infra machine template can easily be replaced, it has allowed us to change the naming scheme of the nodes whenever we want through a rollout. The new system is unfortunately less flexible.

The control plane name can never change, so it is impossible* to change the naming scheme of control plane nodes during the lifecycle of the cluster. MachineDeployments cannot change either, but can at least be replaced. This is less convenient than just a rollout though.

For long lived existing cluster, the breaking change means that the node names, host names, infra machines and infrastructure resources changes permanently without a way to opt out. The KCP names did not influence the naming previously, which means they may in some cases have "improper names" that cannot be fixed at this point without a rebuild of the cluster.

I think it is quite reasonable to want a different naming scheme for the nodes than for the CAPI objects or at least some way of changing the naming scheme. The end user that interacts with the workload cluster may not always even be aware of how that cluster is managed. It is then nice to be able to set meaningful node names for this user and change the scheme during the lifecycle of the cluster.

*) It may be possible to change the configuration and in that way impact the node names. With a KubeadmControlPlane for example, it may be possible to set the spec.kubeadmConfigSpec.joinConfiguration.nodeRegistration.name. It remains unclear if it is possible to achieve something like a prefix + random suffix in this way though.

Anything else you would like to add?

I was tempted to label it a bug first since it limits functionality that was there before, but I do understand that this is intended behavior and was correctly marked as a breaking change.

One possible way to allow changing the naming scheme, would be to add generateName so that we could set this in the template and get generated names based on it.

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 18, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 18, 2024
@lentzi90
Copy link
Contributor Author

Some further investigations show that it is possible to change the node names through the kubeadm config as mentioned above.
Here is an example:

apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata:
  name: lentzi90-test
spec:
  template:
    spec:
      joinConfiguration:
        nodeRegistration:
          name: 'my-prefix-{{ local_hostname }}'

This allows changing the names for existing KCPs and MDs, but it is rather limited compared to the old behavior.
The node names needs to be unique, so there is a requirement to include some metadata that is unique for the node. In many cases there will not be much other than the hostname.
It is possible to use this method for setting a prefix or suffix for the nodes at least, but the hostname will then still be part of that name.

@fabriziopandini
Copy link
Member

fabriziopandini commented Apr 18, 2024

I'm sorry the PR caused some issue to you.
There are a couple of issues here to be addressed, might be the most important is that there is no "contract" about how node name are generated. This has never been implemented because it depends by cloud providers. Think e.g. AWS. Also, some provides decided to name infrastructure objects starting from the machine, others starting from the machine, other concatenating names of with the cluster name. We ca definetly do better

Going back to the use case, frankly speaking, having different naming conventions for MD and for the underlying nodes seems confusing to me.

However, we can think a little bit more about this and eventually implement something like https://cluster-api.sigs.k8s.io/tasks/experimental-features/cluster-class/write-clusterclass#clusterclass-with-custom-naming-strategies and introduce a sort of custom naming strategies for machines under KCP, MD (MP?).

Another option is to solve this at infrastructure provider level, by providing an option for naming VMs created on the cloud provider (I kind of prefer this option)

Another take-away is to improve communication about those kind of changes; we currently always discuss them at office hours, we recently started surfacing them in release note starting from alpha, beta releases, but this is not probably enough

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 18, 2024
@lentzi90
Copy link
Contributor Author

I have gotten more information, and it is even worse than I thought.
There are downstream tooling and customers that rely on the names of the underlying resources (OpenStack VMs, ports, volumes, etc). An upgrade to v1.7.0 breaks all of it and there is no way around it. I guess we will be stuck on v1.6 for the time being. 🙁

One example of how this impacts downstream is that we have scripts in CI that ensure cleanup of all resources. These scripts look for specifically named resources and delete them when the tests are done. Because of this change they do not work anymore.

@lentzi90 lentzi90 changed the title Make it possible to change node naming scheme Make it possible to change infra machine naming scheme Apr 22, 2024
@neolit123
Copy link
Member

feature gates are supposed to be ephemeral until a feature graduates, is it expected from users and customers to eventually migrate to the new way? if no, then this has to resolved in a different way.

@neolit123
Copy link
Member

Going back to the use case, frankly speaking, having different naming conventions for MD and for the underlying nodes seems confusing to me.

agreed.

However, we can think a little bit more about this and eventually implement something like https://cluster-api.sigs.k8s.io/tasks/experimental-features/cluster-class/write-clusterclass#clusterclass-with-custom-naming-strategies and introduce a sort of custom naming strategies for machines under KCP, MD (MP?).

+1 to this approach.

Another take-away is to improve communication about those kind of changes; we currently always discuss them at office hours, we recently started surfacing them in release note starting from alpha, beta releases, but this is not probably enough

release notes should be the source of truth. people miss zoom calls and don't watch the recording.

@neolit123
Copy link
Member

I have gotten more information, and it is even worse than I thought.
There are downstream tooling and customers that rely on the names of the underlying resources (OpenStack VMs, ports, volumes, etc). An upgrade to v1.7.0 breaks all of it and there is no way around it. I guess we will be stuck on v1.6 for the time being. 🙁

One example of how this impacts downstream is that we have scripts in CI that ensure cleanup of all resources. These scripts look for specifically named resources and delete them when the tests are done. Because of this change they do not work anymore.

is that all the breakage or is there something more serious? this seems like something that can be worked around and said scripts can be more flexible. naming is not an API and there is no contract as Fabrizio pointed out.

@lentzi90
Copy link
Contributor Author

feature gates are supposed to be ephemeral until a feature graduates, is it expected from users and customers to eventually migrate to the new way? if no, then this has to resolved in a different way.

Yes, this is my attempt at a "nicer way to break the naming". I.e. give people some time to adapt while still having the old behavior available. We can completely remove it in the next minor release. I'm hoping we can take this in as a patch so that affected people are still able to upgrade to v1.7 without first adapting to the new naming scheme.

@lentzi90
Copy link
Contributor Author

is that all the breakage or is there something more serious? this seems like something that can be worked around and said scripts can be more flexible. naming is not an API and there is no contract as Fabrizio pointed out.

I would say it is quite serious since affects existing, long lived clusters. Imagine something like this:

You have 2 clusters named Cluster-A and Cluster-B. Both have control-planes named KCP. They may have been around for years already and the control-plane name is obviously not possible to change. The InfraMachineTemplates are named with prefixes cluster-a- and cluster-b-, which allows for easy tracking of what cluster all InfraMachines belong to.
"Coincidentally", all the related infrastructure providers resources belonging to Cluster-A got names with prefix cluster-a- and were very easy to track.

Upgrading to v1.7.0 means control-plane InfraMachines for both clusters are suddenly named kcp-* with no way of changing it except creating a new cluster. All the related infrastructure providers resources are also now named kcp-* and very hard to track.

I understand that the naming of the nodes and these resources are not part of the API, but please understand that people have gotten used to the way it was and may need time to adapt.

With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.

https://www.hyrumslaw.com/

@fabriziopandini
Copy link
Member

fabriziopandini commented Apr 24, 2024

I understand that this is causing stress and issues, we stated in our manifests that we would try to evolve responsibly, and I'm happy to chat about options about how to get around what happened if possible.

However, as a project we have the right and the need to change, and we put a lot of effort in making it clear what are the guarantees we offer, and we work as hard as possible to continuosly improve them in a trasparent way.

This issue is another reminder of how important it is for the long term success of the project that the users have to collaborate with the project to continuosly improve our definitions of contract & guarantees.

WRT to this specific issue, we opened time ago #7030 to define guidelines about naming for infrastructure objects, but unfortunately it never got traction, and so here we are.

And it is a proof of the fact that assuming a thing is a contract just because it is there since some time is simply wrong, fragile, and not a sustainable assumption for any project.

@lentzi90
Copy link
Contributor Author

I understand and agree with you @fabriziopandini . I just wish we could for now soften the impact of the breaking change with something like #10505. That would allow us time to adapt without being stuck on v1.6. That PR is not meant to provide a permanent way to get back the old behavior. It is just to let us have it for one more minor release, hidden behind a deprecated and by default disabled feature flag. Is that too much to ask?

@sbueringer
Copy link
Member

sbueringer commented Apr 25, 2024

We should probably split the discussion in two:

  1. Should we provide a temporary way to keep the 1.6 behavior and for how long. I think only implementating it on release-1.7 would ensure it's actually temporary and we're not forced to keep it eventually forever.
  2. The need of establishing contracts for naming of servers and nodes based on the objects core CAPI creates (eg Machine and/or Inframachine). I think this would require work to achieve consensus across infra providers, probably breaking changes across providers to adhere to this contract and it might not be possible for all providers.

That being said, I'm personally fine with 1. Especially as we won't be able to achieve 2 quickly for 1.7. I would prefer a very clearly named flag over a feature flag to not send the wrong message. But ultimately that's an implementation detail for me.

(Discussed this topic with Fabrizio and Christian yesterday, don't want to put words in anyone's mouth, so let's give them some time to respond (Fabrizio is back on Monday))

@lentzi90
Copy link
Contributor Author

lentzi90 commented Apr 25, 2024

Thank you @sbueringer ! I agree and would like to focus on a temporary solution first since I know the second point will take more time.
Flag or feature flag does not matter to me. I did it as a feature flag in the PR just because it seemed to provide a clear deprecation warning 🙂

Edit: Good point about implementing it only on release-1.7! I can change the PR to target that branch

@chrischdi
Copy link
Member

Thanks @sbueringer , that sums it up pretty nicely.

Regarding 1.:

I'm fine with that too, adding a fallback mode to v1.7.x to enable folks running into issues to migrate over.

The next patch release for v1.7 is scheduled for "Tuesday 14th May 2024" [0].

Regarding 2.:

I think this is complex, sometimes it may be required to have hostname == nodename (e.g. maybe on baremetal?), sometimes its ok for them to be different, maybe that also depends on the provider and/or CPI implementation used or other factors.
Definitely nothing achievable in short-term.

@guettli
Copy link
Contributor

guettli commented Apr 25, 2024

I think this is complex, sometimes it may be required to have hostname == nodename (e.g. maybe on baremetal?)

Yes, this would be nice, but this would be a different PR.

@lentzi90
Copy link
Contributor Author

I replaced the PR with one for release-1.7 now. That way the flag would never even be part of v1.8. It is still a feature flag but I'm very open to change that if there is consensus that it would be better to have something different.

@lentzi90
Copy link
Contributor Author

Updated the description to clarify better what I'm after here. The main goal I have on the CAPI side is to get a temporary way to get the old behavior back. This is to allow more time to adapt and specifically targeted at old long lived clusters. These clusters would have to be rebuilt in order to get proper node names, hostnames, infra machine names and infrastructure resource names. They currently have identical KCP names because that never influenced anything before.

In the long term, it looks like we will need a feature in CAPO to address the naming of infrastructure resources. It is not relevant to this issue so I won't discuss it anymore here.

@mdbooth
Copy link
Contributor

mdbooth commented Apr 29, 2024

We should probably split the discussion in two:

  1. Should we provide a temporary way to keep the 1.6 behavior and for how long. I think only implementating it on release-1.7 would ensure it's actually temporary and we're not forced to keep it eventually forever.

It sounds like we have agreement that we should do this. Ideally this would land well in advance of the next point release. #10511 looks reasonable to me. Given that it's a temporary solution, can we go forward with this? I worry that it will stall if we wait for consensus.

  1. The need of establishing contracts for naming of servers and nodes based on the objects core CAPI creates (eg Machine and/or Inframachine). I think this would require work to achieve consensus across infra providers, probably breaking changes across providers to adhere to this contract and it might not be possible for all providers.

I don't think this is necessary. There was previous discussion around writing some guidelines around this in #7030 which also highlights the variety of trip hazards in cloud resource naming. I think some guidelines for implementors would be welcome, but it is unrealistic to expect that any single naming scheme that CAPI could use will work for all clouds in all circumstances.

However, this is not at the root of this issue.

The de-facto situation is that providers each have their own scheme for naming cloud resources. All the schemes I am personally aware of are based on the name of the InfraMachine. I think that is as far as a CAPI contract needs to go:

Infrastructure resources created by a provider SHOULD be named based on the corresponding CAPI infrastructure resource, modified by a predictable provider-specific scheme which ensures names are valid for the target environment.

I suspect this also captures the current state of providers. This is not the problem. The problem is that given that this is already how we name infrastructure resources, the name of the CAPI infrastructure resource changed, which resulted in a user-visible change.

I am in favour of change #9833. It 'fixes' an inconsistency of naming which is currently a trip hazard: Machine and InfraMachine (and therefore infrastructure resources) have different names. Having all of these things aligned is simpler for both operators and developers.

The problem, therefore, is that the names of Infrastructure were, in practice, user-visible, and we changed them. @lentzi90's users have been affected by this. His users are not the only users who have organizational expectations around resource naming. Here is another example I am aware of in OpenShift: https://issues.redhat.com/browse/RFE-4901. This refers to CPMS, which an OpenShift controller, but the problem is analogous and we will want to bring this use case into CAPI shortly in any case. FWIW this user is on Azure. Note these words from the RFE:

Due to the organization compliance needs, it’s required to give specific nomenclature to the nodes.

'Organization compliance needs' cover a wide range of crimes and are often notoriously inflexible. A great many Rube Goldberg machines have been created in order to satisfy 'organization compliance needs'. @lentzi90 is the first reporter, but this is going to come up again across multiple providers.

My immediate concern is that @lentzi90 has proposed kubernetes-sigs/cluster-api-provider-openstack#2035. If we accept this is will again break the naming alignment which we got from #9833. It will also complicate both the CAPO API and implementation in a way that is entirely custom to OpenStack and will not help Azure users. I am reluctant to add this to our stable API.

I will assert that we want the naming of the following 3 things to be aligned:

  • Machine
  • InfrastructureMachine
  • Underlying infrastructure resources

Given that the Machine is created first, I think the only possible solution which results in these things being aligned is to control the name of Machine at the time it is created. @sbueringer already highlighted the places we need to look for this in #9833: KCP, MachineSets and MachinePools.

I don't have any strong opinion on how this should be achieved. I had an idea that MachineTemplateSpec seems to be a common structure here. Would it be worth adding something to that struct so that whatever method we chose to allow users to affect the naming of resources in their clusters is common to all things which create machines from a template?

Lennart's proposed solution to add a NamePrefix sounds reasonable to me. Could we, for example, add a NamePrefix to MachineTemplateSpec? If not, is there some better way we can allow users to control the naming of Machine objects in their clusters?

@sbueringer
Copy link
Member

sbueringer commented Apr 29, 2024

All the schemes I am personally aware of are based on the name of the InfraMachine.

CAPV is using the Machine name for some reason (I don't know why, I'm also not aware how other providers are handling this).

One thing we might want to keep in mind. Most of our users are probably used to how Deployment > ReplicaSet > Pod & MachineDeployment > MachineSet > Machine names line up with each other. This makes it today very easy to figure out which Machine belongs to which MS / MD / KCP object. Once we start allowing folks to have Machine names that are entirely independent of the owning object we will break this easy correlation.

@mdbooth
Copy link
Contributor

mdbooth commented Apr 29, 2024

All the schemes I am personally aware of are based on the name of the InfraMachine.

CAPV is using the Machine name for some reason (I don't know why, I'm also not aware how other providers are handling this).

That's fine, because with #9833 this is now also consistent with the Infrastructure Machine.

One thing we might want to keep in mind. Most of our users are probably used to how Deployment > ReplicaSet > Pod & MachineDeployment > MachineSet > Machine names line up with each other. This makes it today very easy to figure out which Machine belongs to which MS / MD / KCP object. Once we start allowing folks to have Machine names that are entirely independent of the owning object we will break this easy correlation.

Sure. However, as @lentzi90's issue points out the KCP in particular is immutable for the lifetime of the cluster. If there is ever any reason to change it there is currently no way to achieve that, and #9833 changed it with no way to retain the old behaviour.

Breaking that correlation is also why I'm not keen to accept kubernetes-sigs/cluster-api-provider-openstack#2035, but what other option do affected users have?

Also a new API should not affect existing users, which is the issue we're trying to retrospectively address with #9833. If users are happy with the new naming scheme they can continue to use it. If they have an operational need to specify some other naming scheme they would have the option to specify one. This would be behaviour that they have opted in to, so presumably not a surprise.

I wonder... does this actually need to be generalised beyond KCP? Replacing a MD is inconvenient but unlike a KCP it can be done. Could we add some machine naming scheme parameter to KCP alone?

@mdbooth
Copy link
Contributor

mdbooth commented Apr 29, 2024

Just in case you didn't have time to read my rather lengthy comment, I'll highlight https://issues.redhat.com/browse/RFE-4901 again. We're going to be asked for more flexibility on this.

@sbueringer
Copy link
Member

sbueringer commented Apr 29, 2024

Breaking that correlation is also why I'm not keen to accept kubernetes-sigs/cluster-api-provider-openstack#2035, but what other option do affected users have?

Sorry I wasn't trying to say we can't do it, I just wanted to at least mention that there might be a downside.

I wonder... does this actually need to be generalised beyond KCP? Replacing a MD is inconvenient but unlike a KCP it can be done. Could we add some machine naming scheme parameter to KCP alone?

If I got it correctly, the use case @lentzi90 has is that his users can influence the OpenStackMachineTemplate name (via rotation) but not re-create MachineDeployments after they have been initially created.

@mdbooth
Copy link
Contributor

mdbooth commented Apr 29, 2024

If I got it correctly, the use case @lentzi90 has is that his users can influence the OpenStackMachineTemplate name (via rotation) but not re-create MachineDeployments after they have been initially created.

It was my understanding that the KCP specifically cannot be renamed, but the InfraMachineTemplate used by the KCP can, and they were relying on this for the naming of their resources. The problem they now have is the KCP cannot be renamed without creating a new cluster, and they have many long-lived clusters.

It was my guess that this may not apply to MDs, which presumably can be renamed(replaced), although that may be operationally inconvenient (but not impossible).

@mdbooth
Copy link
Contributor

mdbooth commented Apr 29, 2024

If we can agree a design for this I can make time to implement it, btw.

@lentzi90
Copy link
Contributor Author

If I got it correctly, the use case @lentzi90 has is that his users can influence the OpenStackMachineTemplate name (via rotation) but not re-create MachineDeployments after they have been initially created.

It was my understanding that the KCP specifically cannot be renamed, but the InfraMachineTemplate used by the KCP can, and they were relying on this for the naming of their resources. The problem they now have is the KCP cannot be renamed without creating a new cluster, and they have many long-lived clusters.

It was my guess that this may not apply to MDs, which presumably can be renamed(replaced), although that may be operationally inconvenient (but not impossible).

Precisely on point @mdbooth ! I think the confusion here comes from my initial description. The first reports we got were indeed about how users can no longer rename nodes, but the core of the issue, the reason why they need that, is exactly what @mdbooth described: They have compliance requirement on hostnames in the network, for node names and the name of infrastructure resources. After the breaking change, everything changed and they were no longer able to change it back, so that is what they reported.

The main use-case is not really to be able to change these names on a whim, it is to have some kind of control over them to begin with. Unfortunately, for long lived clusters that practically means that we need a way to opt out of the breaking change or a way to override the naming of the Machines or infrastructure resources.

KCPs are definitely the biggest issue here as they can never change. MDs can be replaced, although it is inconvenient compared to a rollout.

@fabriziopandini
Copy link
Member

fabriziopandini commented Apr 30, 2024

Thanks all, this is a valuable discussion.
It seems we all agree that CAPI should have by default "sane" naming conventions, and that we probably need to find ways to allow flexibility if and when required (and we also already did something similar in ClusterClasses).

However, what comes to my mind when reading this thread or when I try to focus on this problem is that we probably have different problems to untangle:

  1. How to generate the name for the infrastructure machine object for machines controlled by KCP or MD (this is what ⚠️ Objects generated by KCP, MachineSets and MachinePools will now consistently use machine name #9833 changed)

  2. Provide guidelines to infrastructure providers about how infrastructure objects (VMs etc) should be named when provisioning a Machine with a corresponding infrastructure machine object.

  3. Which options exist for Influencing how the node hosted on a machine is named, and eventually extend those options.

The reality is that WRT to 2 - how infrastructure objects (VMs etc) are named - providers are fragmented:

  • 2.a Some providers use the machine name (eventually replacing invalid characters according to the infrastructure naming rule)
  • 2.b Some providers use the infrastructure machine name (eventually replacing invalid characters); these providers were affected by the change in 1.
  • 2.c Some providers combine one of the previous two options with the cluster name - and eventually the namespace name - to avoid name clash at the infrastructure level
  • 2.d In theory infrastructure providers can use any other naming schema or even allow custom naming schema definitions on their API, but I'm not sure we should worry about this (it seems like a sort of break glass option).

Unfortunately, we have fragmentation also for 3 - naming nodes -, and depending on the approach you rely on, this point might be affected by the naming decision in 1 and 2 or not; e.g. just focusing on folks relying on kubeadm:

  • 3.a Some providers leave it to kubeadm to pick the node name, which in most providers is derived by the VM name (notable exception AWS).
  • 3.b Some other providers try to influence the node name by using a combination of kubeadm's nodeRegistration.name, cloud-init metadata + jinja templating (cloud init metadata vary from provider to provider)
  • 3.c In theory folks can influence the kubeadm's nodeRegistration.name in other ways: cluster class patches, webhooks etc. Also in this case, this last option seems like a sort of break glass option.

Trying to find a way forward, from one side, I'm tempted to try to break down all of those three points into distinct discussions.

From the other side I'm tempted to consider the first one is the most urgent, because it could provide "some kind of control over them to begin with", but this leaves me with a concern that we are sacrifying the consistency of the UX inside the management cluster just to avoid to deal with the fragmentation in layers 2 and 3.

Opinions?

@mdbooth
Copy link
Contributor

mdbooth commented Apr 30, 2024

Thanks, I think that's a good summary.

My immediate concern is that our provider was affected by the change, and one of our users is impacted by the change in a manner which currently prevents them from upgrading to CAPI 1.7. Does it make it more palatable if we confine any solution to just KCP, as that's the only controller we're aware of with no workaround? I'll reiterate my offer to work on this if we can agree a design.

@fabriziopandini
Copy link
Member

I think that we have an agreement on implementing a stop gap for 1.7; my comment is mostly about looking for a long term solution to implement hopefully for 1.8; no strong objection in starting from KCP, but whatever we should ensure consistency down the path

@mdbooth
Copy link
Contributor

mdbooth commented Apr 30, 2024

@lentzi90 What are your thoughts?

@lentzi90
Copy link
Contributor Author

lentzi90 commented May 2, 2024

Sounds good to me also! If we can get a stop gap for 1.7 I would be very happy. I'm also prepared to work on a more long term solution as soon as we can agree on what it would look like 🙂

@lentzi90
Copy link
Contributor Author

lentzi90 commented May 2, 2024

Do we need anything more to move forward with #10511 ?

@mdbooth
Copy link
Contributor

mdbooth commented May 2, 2024

@lentzi90 I brought #10511 up in the office hours call last night.

My worry is that the clock is now ticking on CAPI v1.8, and we don't even have an agreement in principal on how to fix this beyond 1.7 yet. I appreciate that adding the flag for 1.7 only was to limit its impact, but I fear we're just setting ourselves up for a panic later if we don't get it merged in time. My preference would be to merge that into main and cherry-pick to 1.7, with a clear agreement to remove it as soon as we have a long-term solution. If we can do that before 1.8 is released then that would be ideal, but I don't think it's a good plan to remove a workaround before we have a fix.

@sbueringer Can you help us get unstuck on this?

@sbueringer
Copy link
Member

sbueringer commented May 2, 2024

My preference would be to merge that into main and cherry-pick to 1.7, with a clear agreement to remove it as soon as we have a long-term solution.

I have no doubts that we're all highly motivated to get this resolved, but nobody is able to guarantee that we'll reach consensus. If I see correctly we don't even have consensus yet on how much of the problem we have to solve as a next iteration after the flag (just how we generate the name of Machine/InfraMachine/BootstrapConfig or the whole contract thing top to bottom).

My concern is mostly that we'll add a flag and we'll never be able to get rid of it again until we reach consensus on a new solution which might never happen. So if we add a flag, how do we ensure it doesn't stick around forever?

@sbueringer
Copy link
Member

sbueringer commented May 2, 2024

My problem is that I saw so many issues / PRs stale out over the years (and for years) in CAPI. And the result is that we're accumuluating more and more tech debt all over the place. Also because it's a bunch of work to deprecate and actually remove something, random example, we still "support" Clusters without ControlPlanes and all other sorts of strange things.

I know comparatively this is a relatively minor issue here.

One of the following things could help:

  1. strictly limit the flag to 1.7 & 1.8. This gives us until December to find a solution, I'm pretty sure if it doesn't happen until then, it will never happen.
  2. Have a consensus in principle on what the long-term solution would look like (so that we only depend on figuring out ~ implementation details, which we should very likely be able to do).

About 2, if there is consensus around that we only introduce a field somewhere in KCP, MD & MS to control a name prefix, that's very clearly easy and straightforward to implement. If we go for naming contracts across the hierarchy this is almost the category of the loadbalancer provider proposal (for context: this has been around for years and it's probably never going to happen)

@mdbooth
Copy link
Contributor

mdbooth commented May 2, 2024

I'd be less worried giving ourselves a 2 release cycle runway. I think we can get that done, and if we don't I guess we'll have to relitigate it.

As for the proposed solution, IIUC the problem controller is KCP, not MD/MS. MD/MS can be replaced without replacing the cluster, but KCP cannot.

@lentzi90 Can you confirm that a solution which was confined to KCP would allow your users to upgrade?

@lentzi90
Copy link
Contributor Author

lentzi90 commented May 3, 2024

Theoretically, yes KCP is enough. It is quite trivial to include MDs at least for the temporary solution though so I'm not sure I see the benefit of leaving it out. Also, having similar API fields and flags for all controllers seems quite sensible to me, long term or short term.
What would be the goal of leaving out MD/MS?

@mdbooth
Copy link
Contributor

mdbooth commented May 3, 2024

Theoretically, yes KCP is enough. It is quite trivial to include MDs at least for the temporary solution though so I'm not sure I see the benefit of leaving it out. Also, having similar API fields and flags for all controllers seems quite sensible to me, long term or short term. What would be the goal of leaving out MD/MS?

Nothing specific other than reducing scope. I also agree that in practise it may well make sense to include both.

@lentzi90
Copy link
Contributor Author

With #10576 merged with a stop-gap soution and #10577 opened for discussing the long term solution, I guess we can close this, right?

@lentzi90
Copy link
Contributor Author

/close

@k8s-ci-robot
Copy link
Contributor

@lentzi90: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants