Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cloud-specific instance storage #1126

Closed
cgwalters opened this issue Nov 18, 2020 · 12 comments
Closed

Support cloud-specific instance storage #1126

cgwalters opened this issue Nov 18, 2020 · 12 comments

Comments

@cgwalters
Copy link
Member

cgwalters commented Nov 18, 2020

This is related to an effort I was looking at around making use of instance-local storage in OpenShift 4: https://hackmd.io/dTUvY7BIQIu_vFK5bMzYvg

It works well to use Ignition to configure the instance store disks; e.g. to mount them at /var. But the problem comes in naming and enumerating them. Take e.g. the AWS m5d instances (docs) - instance storage can be any of 1, 2, or 4 disks. In GCP it's supported to attach up to 9.

As far as I can tell one could generally rely on /dev/nvme0n1 being the boot drive (which we don't want to format obviously) and /dev/nvme1n1 and beyond being the instance storage. The disk IDs make it quite clear:

bash-5.0$ ls -al /dev/disk/by-id/
total 0
drwxr-xr-x. 2 root root 360 Nov 18 21:20 .
drwxr-xr-x. 8 root root 160 Nov 18 21:20 ..
lrwxrwxrwx. 1 root root  13 Nov 18 21:20 nvme-Amazon_EC2_NVMe_Instance_Storage_AWS2DB46DE31B58F726F -> ../../nvme1n1
lrwxrwxrwx. 1 root root  13 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4 -> ../../nvme0n1
lrwxrwxrwx. 1 root root  15 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4-part1 -> ../../nvme0n1p1
lrwxrwxrwx. 1 root root  15 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4-part2 -> ../../nvme0n1p2
lrwxrwxrwx. 1 root root  15 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4-part3 -> ../../nvme0n1p3
lrwxrwxrwx. 1 root root  15 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4-part4 -> ../../nvme0n1p4
...
[bound] -bash-5.0$ 

But...that 2DB46DE31B58F726F value is dynamic.

Anyways, one idea is to directly support this in Ignition:

{
  "ignition": { "version": "3.0.0" },
  "storage": {
    "instance-disks": "stripe",
    "filesystems": [{
      "device": "/dev/ignition/instance-storage",
      "path": "/var",
      "format": "xfs",
      "label": "DATA"
    }]
  },
  "systemd": {
    "units": [{
      "name": "var.mount",
      "enabled": true,
      "contents": "[Mount]\nWhat=/dev/ignition/instance-storage\nWhere=/var\nType=xfs\n\n[Install]\nWantedBy=local-fs.target"
    }]
  }
}

This would automatically find all instance-local disks and use RAID0 if appropriate (or just match the single block device directly).

Now clearly the MCO and machineAPI (for example) could be set up to pass a correct Ignition userdata to the instance depending on its type...but that requires exact coordination between the thing provisioning the VM and the provided Ignition, and also encoding an understanding of instance types into the thing rendering the Ignition (in the AWS case).

I suspect support for striping would cover 90% of cases and allow people to use a common Ignition config for multiple scenarios.

But it would add more cloud specifics into Ignition.

Another approach is to punt basically and don't use Ignition partitioning; have a systemd unit that runs in the real root that can be cloud-aware and e.g. specifically generate a mount unit for /var/lib/containers (e.g.) as opposed to all of /var. But supporting /var as instance storage is so much more elegant.

@cgwalters
Copy link
Member Author

cgwalters commented Nov 18, 2020

This problem domain clearly generalizes into e.g. bare metal scenarios with hetrogenous server hardware and one wants to be able to say something dynamic like "raid 1 all drives you see matching this set of hardware vendors".

@cgwalters
Copy link
Member Author

One possibility I guess would be to go to a "two phase" approach where the instance boots in an ephemeral mode (tmpfs on /etc and /var), runs arbitrary code to inspect the system, generate an Ignition config and drop it into /boot then rerun Ignition. But a key use case for instance local storage is autoscaling preemptible VMs for ephemeral workloads, and the extra reboot is kind of eww for that.

@bgilbert
Copy link
Contributor

Ignition generally does exactly what it's told to do, and doesn't automatically detect things. But Afterburn is all about querying the cloud platform for instance metadata. Spitballing a differently-hacky idea: add an Afterburn mode that runs before Ignition config fetch, generates an Ignition config fragment, and drops it in the base config directory. That mode would (currently) have to run before config fetch, so the config fragment couldn't be based on any user-provided configuration. However, it could aggregate all the instance disks into a RAID with a well-known name. The user config could then change the RAID level, if desired, and put a filesystem on top.

Downside: the automatically-generated RAID would preclude the user from putting filesystems directly on individual instance disks.

@cgwalters
Copy link
Member Author

Hmm at least in AWS it doesn't seem like the instance store devices are in the metadata; they just show up as block devices to the instance.

Further, we can't hardcode a policy in afterburn; it needs to be supported for something else to use the instance storage (e.g. Ceph, a database cache, etc). It'd be a backwards incompatible change for us to default to consuming it.

Perhaps afterburn could try to gather a convenient list of block devices (something like symlinks in /dev/coreos/instance-storage/) and then...if we had glob support in Ignition (something like "if passed 1 block device, just pass it through, otherwise raid0") then the Ignition config could use that?

The block level aspect makes this much more Ignition than Afterburn though I think.

Ignition generally does exactly what it's told to do, and doesn't automatically detect things.

Yeah I know. But...ugly tradeoffs abound. This would be very "cloud native" at least.

@bgilbert
Copy link
Contributor

Hmm at least in AWS it doesn't seem like the instance store devices are in the metadata; they just show up as block devices to the instance.

It appears that non-NVMe devices should show up in instance metadata, and NVMe devices can be distinguished by device model.

Perhaps afterburn could try to gather a convenient list of block devices (something like symlinks in /dev/coreos/instance-storage/) and then...if we had glob support in Ignition (something like "if passed 1 block device, just pass it through, otherwise raid0") then the Ignition config could use that?

Yeah, I'm not immediately seeing a clean solution. It seems worth more discussion, though. I think the approach most consistent with Ignition's design is to say "the Ignition config is expected to understand any hardware it wants to configure", but as you say, the rest of the stack may not be equipped to deal with that.

@cgwalters
Copy link
Member Author

Hum. I guess at least for OpenShift, the fact that we always perform an OS update+reboot means we could wedge this whole thing into the MCO or in custom Ignition to start, basically blow away + remount /var/lib/containers and /var/lib/etcd on the firstboot.

We can experiment with that and if successful try to drive it into base CoreOS.

It appears that non-NVMe devices should show up in instance metadata,

Yeah, the ones I'm interested in here are NVMe.

and NVMe devices can be distinguished by device model.

Right, but...hm, I guess maybe we could add "model matching" into Ignition? That could be generic enough to work across bare metal too.

@arithx
Copy link
Contributor

arithx commented Nov 19, 2020

Hum. I guess at least for OpenShift, the fact that we always perform an OS update+reboot means we could wedge this whole thing into the MCO or in custom Ignition to start, basically blow away + remount /var/lib/containers and /var/lib/etcd on the firstboot.

At least for the IPI case OCP would know exactly what storage is present in the instance type / it has configured to be added and could automatically generate the relevant config snippet to do this under the current Ignition model (without the additional symlinks).

@cgwalters
Copy link
Member Author

@cgwalters
Copy link
Member Author

At least for the IPI case OCP would know exactly what storage is present in the instance type / it has configured to be added and could automatically generate the relevant config snippet to do this under the current Ignition model (without the additional symlinks).

It could know, but doesn't today and fixing it is nontrivial. We currently have a single pointer config applied to all instance types. The thing provisioning VMs (https://github.com/openshift/machine-api-operator) is distinct from the thing generating Ignition configs (https://github.com/openshift/machine-config-operator/) with just a few links between them. We absolutely could rearchitect this; that's openshift/machine-config-operator#1619

I thought more about this though and agree taking that direction long term would be cleaner.

We might need some better mechanisms in either Ignition or CoreOS to do the "matching"; maybe a udev rule that generates e.g. /dev/aws/instancestore0 and /dev/aws/instancestore1 or so. A generalization of this would be Ignition having something like a "query language" around block devices; kind of like a tiny subset of what udev rules allow. You can see some of the queries invented in https://github.com/cgwalters/coreos-cloud-instance-store-provisioner/blob/azure/src/main.rs#L191

Anyways closing based on the above for now.

@lucab
Copy link
Contributor

lucab commented Nov 25, 2020

We might need some better mechanisms in either Ignition or CoreOS to do the "matching"; maybe a udev rule that generates e.g. /dev/aws/instancestore0 and /dev/aws/instancestore1 or so

For context, we already have cases where we do this cloud-specific symlinking for block devices in initramfs via udev rules.
A few of those rules relies on external bash scripts, as the conditional logic is usually quite simple. If it gets more complex, it can be stuffed into Afterburn too.

cgwalters added a commit to cgwalters/fedora-coreos-docs that referenced this issue Dec 15, 2020
cgwalters added a commit to cgwalters/fedora-coreos-docs that referenced this issue Dec 15, 2020
cgwalters added a commit to cgwalters/fedora-coreos-docs that referenced this issue Dec 15, 2020
cgwalters added a commit to cgwalters/fedora-coreos-docs that referenced this issue Dec 15, 2020
jlebon pushed a commit to coreos/fedora-coreos-docs that referenced this issue Dec 15, 2020
@jlebon
Copy link
Member

jlebon commented Jul 13, 2022

I think conceptually this was moved to e.g. coreos/fedora-coreos-tracker#1122, coreos/fedora-coreos-tracker#601, coreos/fedora-coreos-tracker#1165, etc...

@cgwalters
Copy link
Member Author

Yeah, also worth noting though that today we have a Live ISO, in which one can execute completely arbitrary code before the install to disk; specifically, one can inspect the hardware and e.g. dynamically generate Ignition that is passed to coreos-installer.

The Live ISO is not necessarily ergonomic to use in all clouds though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants