Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute AZs should not be used as Volume AZs #1649

Closed
stephenfin opened this issue Aug 23, 2023 · 11 comments · Fixed by #2008
Closed

Compute AZs should not be used as Volume AZs #1649

stephenfin opened this issue Aug 23, 2023 · 11 comments · Fixed by #2008
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@stephenfin
Copy link
Contributor

/kind feature

Describe the solution you'd like

In #1030, we added support for defining root volume AZs, closing #1021. When this was implemented, a decision was made to use the MachineSpec's failureDomain attribute to populate rootVolume.availabilityZone if the latter was not configured. This feels like an anti-pattern. Compute AZs are not Volume AZs, and while they may be identical in smaller, simpler deployments with a single AZ (Cinder's default AZ is nova to allow this) or hyper-converged deployments, where nova-compute and cinder-volume run side-by-side on hosts, it's unlikely to be the case elsewhere. I've seen this pop up already in different deployments.

Is there any reason why we can't default to no implicit AZ or at least validate the implicit AZ exists?

Anything else you would like to add:

I'm not sure if this change in behavior would need to be placed behind an API version or not. Would welcome input in any case.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 23, 2023
@mdbooth
Copy link
Contributor

mdbooth commented Aug 30, 2023

Before #1030 we let Nova create the root volume for us. I'm pretty sure my intention at the time was to copy the behaviour of Nova to minimise the change involved, so I certainly believed that Nova would explicitly request a cinder volume in an AZ with the same name as the Nova AZ if a Nova AZ was given. I can't remember how deeply I checked that at the time, though.

However, there's no reason we can't change it now. I believe it would be a breaking API change, so if we think this is important to change let prioritise it so we get it in before v1beta1.

We could do something like this in the current version:

  rootVolume:
    size: 50
    ignoreMachineAZ: true

In the next version we could flip the default and add:

  rootVolume:
    size: 50
    useMachineAZ: true

@mdbooth
Copy link
Contributor

mdbooth commented Sep 7, 2023

I think we could be even simpler than this. On up-conversion, if the AZ was not previously given we hydrate it to... we can't do that because the AZ is in the Machine object not the OpenStackMachine object, so isn't available during conversion :( As you were.

@stephenfin
Copy link
Contributor Author

stephenfin commented Sep 11, 2023

Before #1030 we let Nova create the root volume for us. I'm pretty sure my intention at the time was to copy the behaviour of Nova to minimise the change involved, so I certainly believed that Nova would explicitly request a cinder volume in an AZ with the same name as the Nova AZ if a Nova AZ was given. I can't remember how deeply I checked that at the time, though.

I thought the same until I went and investigated. It turns out this is the behaviour of Nova but only if [cinder] cross_az_attach is set to False. I documented this (and various other bits of AZ-relatd info) in a blog post.

However, there's no reason we can't change it now. I believe it would be a breaking API change, so if we think this is important to change let prioritise it so we get it in before v1beta1.

We could do something like this in the current version:

  rootVolume:
    size: 50
    ignoreMachineAZ: true

In the next version we could flip the default and add:

  rootVolume:
    size: 50
    useMachineAZ: true

I think we could be even simpler than this. On up-conversion, if the AZ was not previously given we hydrate it to... we can't do that because the AZ is in the Machine object not the OpenStackMachine object, so isn't available during conversion :( As you were.

Sounds like we have a strategy going forward so. Hurrah!

@EmilienM
Copy link
Contributor

/assign EmilienM

@EmilienM
Copy link
Contributor

/unassign Emilien
I have other tasks I want to tackle first in CAPO, so I'll let someone looking at it.

@EmilienM EmilienM removed their assignment Sep 29, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
@mdbooth
Copy link
Contributor

mdbooth commented Jan 30, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2024
@mdbooth
Copy link
Contributor

mdbooth commented Feb 21, 2024

To be clear: the missing feature here is the ability to specify use cinder's default. It is missing because we always default root volume AZ to be the same as the machine AZ if it is not given, which is not always correct. We don't want to have to specify cinder's default, because then we wouldn't be using the default.

The question is which case do we want to 'optimise' for: the case where they are the same, or the case where they are not. i.e. Which should the default be?

Note that the following is not an option:

rootVolume:
    ...
    availabilityZone: <explicitly put machine AZ here>

the reason being that CAPI can vary the control plane machines' failure domains, so we can't hardcode it in the machine spec.

How about we added the following to RootVolume:

AvailabilityZone: string,
// +required
// default=RootVolume
AvailabilityZoneSource: AvailabilityZoneSourceType,

AvailabilityZoneSourceType can be 'RootVolume', 'Machine', or 'Cinder'. It defaults to 'RootVolume' if AvailabilityZone is set on creation, and 'Machine' if it is not, which preserves the current behaviour. Setting it explicitly to 'Cinder' allows the Cinder default AZ to be used.

I don't know if that defaulting is possible without a mutating webhook, btw. I'd have to investigate. It's probably not too complex even if the webhook is required.

With the above:

Uses Machine AZ:

rootVolume:
    ... (no AZ)

Uses explicit AZ:

rootVolume:
    ...
    availabilityZone: my_custom_az

Both of the above preserve the existing behaviour for the same input. The new behaviour is:

rootVolume:
    ...
    availabilityZoneSource: Cinder

This uses the Cinder default, which was not previously possible.

In all cases, the relevant availabilityZoneSource would always be added to the object on creation, so it would always be immediately apparent when fetching the object which AZ was in use.

@mdbooth
Copy link
Contributor

mdbooth commented Feb 21, 2024

Actually, written this way this would be a backwards compatible change. This might not be a rush for v1beta1 after all.

@EmilienM
Copy link
Contributor

Cinder default AZ to be used.

Where would it be defined, or how would we figured this one out?
In OpenStackCluster.Spec?

@mdbooth
Copy link
Contributor

mdbooth commented Feb 21, 2024

Cinder default AZ to be used.

Where would it be defined, or how would we figured this one out? In OpenStackCluster.Spec?

It would be defined in Cinder. We would not specify any value.

The API problem we have here is that 'no value' has 2 possible resolutions:

  • Explicitly use whatever was defined on the machine
  • Use nothing and let Cinder decide.

We can currently do the former, but not the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: Done
5 participants