Support cloud-specific instance storage #1126

cgwalters · 2020-11-18T21:45:00Z

This is related to an effort I was looking at around making use of instance-local storage in OpenShift 4: https://hackmd.io/dTUvY7BIQIu_vFK5bMzYvg

It works well to use Ignition to configure the instance store disks; e.g. to mount them at /var. But the problem comes in naming and enumerating them. Take e.g. the AWS m5d instances (docs) - instance storage can be any of 1, 2, or 4 disks. In GCP it's supported to attach up to 9.

As far as I can tell one could generally rely on /dev/nvme0n1 being the boot drive (which we don't want to format obviously) and /dev/nvme1n1 and beyond being the instance storage. The disk IDs make it quite clear:

bash-5.0$ ls -al /dev/disk/by-id/
total 0
drwxr-xr-x. 2 root root 360 Nov 18 21:20 .
drwxr-xr-x. 8 root root 160 Nov 18 21:20 ..
lrwxrwxrwx. 1 root root  13 Nov 18 21:20 nvme-Amazon_EC2_NVMe_Instance_Storage_AWS2DB46DE31B58F726F -> ../../nvme1n1
lrwxrwxrwx. 1 root root  13 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4 -> ../../nvme0n1
lrwxrwxrwx. 1 root root  15 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4-part1 -> ../../nvme0n1p1
lrwxrwxrwx. 1 root root  15 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4-part2 -> ../../nvme0n1p2
lrwxrwxrwx. 1 root root  15 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4-part3 -> ../../nvme0n1p3
lrwxrwxrwx. 1 root root  15 Nov 18 21:20 nvme-Amazon_Elastic_Block_Store_vol09ba5de1ae91458e4-part4 -> ../../nvme0n1p4
...
[bound] -bash-5.0$

But...that 2DB46DE31B58F726F value is dynamic.

Anyways, one idea is to directly support this in Ignition:

{
  "ignition": { "version": "3.0.0" },
  "storage": {
    "instance-disks": "stripe",
    "filesystems": [{
      "device": "/dev/ignition/instance-storage",
      "path": "/var",
      "format": "xfs",
      "label": "DATA"
    }]
  },
  "systemd": {
    "units": [{
      "name": "var.mount",
      "enabled": true,
      "contents": "[Mount]\nWhat=/dev/ignition/instance-storage\nWhere=/var\nType=xfs\n\n[Install]\nWantedBy=local-fs.target"
    }]
  }
}

This would automatically find all instance-local disks and use RAID0 if appropriate (or just match the single block device directly).

Now clearly the MCO and machineAPI (for example) could be set up to pass a correct Ignition userdata to the instance depending on its type...but that requires exact coordination between the thing provisioning the VM and the provided Ignition, and also encoding an understanding of instance types into the thing rendering the Ignition (in the AWS case).

I suspect support for striping would cover 90% of cases and allow people to use a common Ignition config for multiple scenarios.

But it would add more cloud specifics into Ignition.

Another approach is to punt basically and don't use Ignition partitioning; have a systemd unit that runs in the real root that can be cloud-aware and e.g. specifically generate a mount unit for /var/lib/containers (e.g.) as opposed to all of /var. But supporting /var as instance storage is so much more elegant.

The text was updated successfully, but these errors were encountered:

cgwalters · 2020-11-18T21:50:20Z

This problem domain clearly generalizes into e.g. bare metal scenarios with hetrogenous server hardware and one wants to be able to say something dynamic like "raid 1 all drives you see matching this set of hardware vendors".

cgwalters · 2020-11-18T21:52:03Z

One possibility I guess would be to go to a "two phase" approach where the instance boots in an ephemeral mode (tmpfs on /etc and /var), runs arbitrary code to inspect the system, generate an Ignition config and drop it into /boot then rerun Ignition. But a key use case for instance local storage is autoscaling preemptible VMs for ephemeral workloads, and the extra reboot is kind of eww for that.

bgilbert · 2020-11-18T22:00:16Z

Ignition generally does exactly what it's told to do, and doesn't automatically detect things. But Afterburn is all about querying the cloud platform for instance metadata. Spitballing a differently-hacky idea: add an Afterburn mode that runs before Ignition config fetch, generates an Ignition config fragment, and drops it in the base config directory. That mode would (currently) have to run before config fetch, so the config fragment couldn't be based on any user-provided configuration. However, it could aggregate all the instance disks into a RAID with a well-known name. The user config could then change the RAID level, if desired, and put a filesystem on top.

Downside: the automatically-generated RAID would preclude the user from putting filesystems directly on individual instance disks.

cgwalters · 2020-11-18T22:42:52Z

Hmm at least in AWS it doesn't seem like the instance store devices are in the metadata; they just show up as block devices to the instance.

Further, we can't hardcode a policy in afterburn; it needs to be supported for something else to use the instance storage (e.g. Ceph, a database cache, etc). It'd be a backwards incompatible change for us to default to consuming it.

Perhaps afterburn could try to gather a convenient list of block devices (something like symlinks in /dev/coreos/instance-storage/) and then...if we had glob support in Ignition (something like "if passed 1 block device, just pass it through, otherwise raid0") then the Ignition config could use that?

The block level aspect makes this much more Ignition than Afterburn though I think.

Ignition generally does exactly what it's told to do, and doesn't automatically detect things.

Yeah I know. But...ugly tradeoffs abound. This would be very "cloud native" at least.

bgilbert · 2020-11-18T23:08:01Z

Hmm at least in AWS it doesn't seem like the instance store devices are in the metadata; they just show up as block devices to the instance.

It appears that non-NVMe devices should show up in instance metadata, and NVMe devices can be distinguished by device model.

Perhaps afterburn could try to gather a convenient list of block devices (something like symlinks in /dev/coreos/instance-storage/) and then...if we had glob support in Ignition (something like "if passed 1 block device, just pass it through, otherwise raid0") then the Ignition config could use that?

Yeah, I'm not immediately seeing a clean solution. It seems worth more discussion, though. I think the approach most consistent with Ignition's design is to say "the Ignition config is expected to understand any hardware it wants to configure", but as you say, the rest of the stack may not be equipped to deal with that.

cgwalters · 2020-11-18T23:31:15Z

Hum. I guess at least for OpenShift, the fact that we always perform an OS update+reboot means we could wedge this whole thing into the MCO or in custom Ignition to start, basically blow away + remount /var/lib/containers and /var/lib/etcd on the firstboot.

We can experiment with that and if successful try to drive it into base CoreOS.

It appears that non-NVMe devices should show up in instance metadata,

Yeah, the ones I'm interested in here are NVMe.

and NVMe devices can be distinguished by device model.

Right, but...hm, I guess maybe we could add "model matching" into Ignition? That could be generic enough to work across bare metal too.

arithx · 2020-11-19T03:39:25Z

Hum. I guess at least for OpenShift, the fact that we always perform an OS update+reboot means we could wedge this whole thing into the MCO or in custom Ignition to start, basically blow away + remount /var/lib/containers and /var/lib/etcd on the firstboot.

At least for the IPI case OCP would know exactly what storage is present in the instance type / it has configured to be added and could automatically generate the relevant config snippet to do this under the current Ignition model (without the additional symlinks).

cgwalters · 2020-11-20T21:53:00Z

PoC implementation here https://github.com/cgwalters/coreos-cloud-instance-store-provisioner

cgwalters · 2020-11-24T15:09:02Z

At least for the IPI case OCP would know exactly what storage is present in the instance type / it has configured to be added and could automatically generate the relevant config snippet to do this under the current Ignition model (without the additional symlinks).

It could know, but doesn't today and fixing it is nontrivial. We currently have a single pointer config applied to all instance types. The thing provisioning VMs (https://github.com/openshift/machine-api-operator) is distinct from the thing generating Ignition configs (https://github.com/openshift/machine-config-operator/) with just a few links between them. We absolutely could rearchitect this; that's openshift/machine-config-operator#1619

I thought more about this though and agree taking that direction long term would be cleaner.

We might need some better mechanisms in either Ignition or CoreOS to do the "matching"; maybe a udev rule that generates e.g. /dev/aws/instancestore0 and /dev/aws/instancestore1 or so. A generalization of this would be Ignition having something like a "query language" around block devices; kind of like a tiny subset of what udev rules allow. You can see some of the queries invented in https://github.com/cgwalters/coreos-cloud-instance-store-provisioner/blob/azure/src/main.rs#L191

Anyways closing based on the above for now.

lucab · 2020-11-25T16:12:51Z

We might need some better mechanisms in either Ignition or CoreOS to do the "matching"; maybe a udev rule that generates e.g. /dev/aws/instancestore0 and /dev/aws/instancestore1 or so

For context, we already have cases where we do this cloud-specific symlinking for block devices in initramfs via udev rules.
A few of those rules relies on external bash scripts, as the conditional logic is usually quite simple. If it gets more complex, it can be stuffed into Afterburn too.

Has come up a few times, see also coreos/ignition#1126

jlebon · 2022-07-13T15:35:57Z

I think conceptually this was moved to e.g. coreos/fedora-coreos-tracker#1122, coreos/fedora-coreos-tracker#601, coreos/fedora-coreos-tracker#1165, etc...

cgwalters · 2022-07-13T15:47:14Z

Yeah, also worth noting though that today we have a Live ISO, in which one can execute completely arbitrary code before the install to disk; specifically, one can inspect the hardware and e.g. dynamically generate Ignition that is passed to coreos-installer.

The Live ISO is not necessarily ergonomic to use in all clouds though.

cgwalters closed this as completed Nov 24, 2020

cgwalters mentioned this issue Nov 24, 2020

MCO: manage machines userdata secret openshift/enhancements#368

Merged

cgwalters added a commit to cgwalters/fedora-coreos-docs that referenced this issue Dec 15, 2020

docs/metal: Briefly discuss hetrogenous hardware

70b6d0a

Has come up a few times, see also coreos/ignition#1126

cgwalters mentioned this issue Dec 15, 2020

docs/metal: Briefly discuss heterogenous hardware coreos/fedora-coreos-docs#219

Merged

cgwalters added a commit to cgwalters/fedora-coreos-docs that referenced this issue Dec 15, 2020

docs/metal: Briefly discuss hetrogenous hardware

db68409

Has come up a few times, see also coreos/ignition#1126

cgwalters added a commit to cgwalters/fedora-coreos-docs that referenced this issue Dec 15, 2020

docs/metal: Briefly discuss hetrogenous hardware

0972930

Has come up a few times, see also coreos/ignition#1126

cgwalters added a commit to cgwalters/fedora-coreos-docs that referenced this issue Dec 15, 2020

docs/metal: Briefly discuss hetrogenous hardware

cfe883f

Has come up a few times, see also coreos/ignition#1126

jlebon pushed a commit to coreos/fedora-coreos-docs that referenced this issue Dec 15, 2020

docs/metal: Briefly discuss hetrogenous hardware

4ec32ab

Has come up a few times, see also coreos/ignition#1126

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cloud-specific instance storage #1126

Support cloud-specific instance storage #1126

cgwalters commented Nov 18, 2020 •

edited

Loading

cgwalters commented Nov 18, 2020 •

edited

Loading

cgwalters commented Nov 18, 2020

bgilbert commented Nov 18, 2020

cgwalters commented Nov 18, 2020

bgilbert commented Nov 18, 2020

cgwalters commented Nov 18, 2020

arithx commented Nov 19, 2020

cgwalters commented Nov 20, 2020

cgwalters commented Nov 24, 2020

lucab commented Nov 25, 2020

jlebon commented Jul 13, 2022

cgwalters commented Jul 13, 2022

Support cloud-specific instance storage #1126

Support cloud-specific instance storage #1126

Comments

cgwalters commented Nov 18, 2020 • edited Loading

cgwalters commented Nov 18, 2020 • edited Loading

cgwalters commented Nov 18, 2020

bgilbert commented Nov 18, 2020

cgwalters commented Nov 18, 2020

bgilbert commented Nov 18, 2020

cgwalters commented Nov 18, 2020

arithx commented Nov 19, 2020

cgwalters commented Nov 20, 2020

cgwalters commented Nov 24, 2020

lucab commented Nov 25, 2020

jlebon commented Jul 13, 2022

cgwalters commented Jul 13, 2022

cgwalters commented Nov 18, 2020 •

edited

Loading

cgwalters commented Nov 18, 2020 •

edited

Loading