Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test iso-offline-install on multipath on ppc64le and aarch64 failing coreos-ignition-unique-boot.service check #1373

Closed
jlebon opened this issue Jan 9, 2023 · 4 comments · Fixed by coreos/fedora-coreos-config#2181

Comments

@jlebon
Copy link
Member

jlebon commented Jan 9, 2023

[2023-01-07T14:56:33.035Z] kola -p qemu-unpriv --output-dir /home/jenkins/agent/workspace/build-arch/tmp/kolaTestIso-r5FDc/kola-testiso-multipath testiso -S --qemu-multipath --scenarios iso-offline-install
...
[2023-01-07T14:59:17.380Z] FAIL: iso-offline-install ( + metal + multipath) (2m41.811s)
[2023-01-07T14:59:17.380Z]     entered emergency.target in initramfs
[2023-01-07T14:59:17.380Z] Error: entered emergency.target in initramfs

coreos-installer runs successfully and reboots the machine, and then:

Error: System has 2 devices with a filesystem labeled 'boot': ["/dev/sdb3", "/dev/mapper/mpatha3"]
coreos-ignition-unique-boot.service: Main process exited, code=exited, status=1/FAILURE
coreos-ignition-unique-boot.service: Failed with result 'exit-code'.
Failed to start coreos-ignition-unique-boot.service - CoreOS Ensure Unique Boot Filesystem.

I suspect something is going wrong with rdcore verify-unique-fs-label's multipath detection.

iso-offline-install.zip

@jlebon jlebon changed the title Test iso-offline-install on multipath on ppc64le failing coreos-ignition-unique-boot.service check Test iso-offline-install on multipath on ppc64le and aarch64 failing coreos-ignition-unique-boot.service check Jan 9, 2023
@jlebon
Copy link
Member Author

jlebon commented Jan 9, 2023

The ppc64le one happened on rawhide and f37. The aarch64 one on f37 with the following diff:

15:57:00  Upgraded:
15:57:00    containers-common 4:1-73.fc37 -> 4:1-76.fc37
15:57:00    containers-common-extra 4:1-73.fc37 -> 4:1-76.fc37
15:57:00    kernel 6.0.16-300.fc37 -> 6.0.17-300.fc37
15:57:00    kernel-core 6.0.16-300.fc37 -> 6.0.17-300.fc37
15:57:00    kernel-modules 6.0.16-300.fc37 -> 6.0.17-300.fc37

Doesn't happen all the time, so there seems to be a flaky component to it.

@dustymabe
Copy link
Member

This seems to have happened again in:

@dustymabe
Copy link
Member

Just saw this on x86_64 in CI for coreos/fedora-coreos-config#2179.

ignition-virtio-dump.txt

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Jan 18, 2023
We're hitting an issue right now where
`coreos-ignition-unique-boot.service` (backed by `rdcore`) is failing
on multipath with:

```
Error: System has 2 devices with a filesystem labeled 'boot': ["/dev/sdb3", "/dev/mapper/mpatha3"]
```

The unique label detection code in `rdcore` determines whether multiple
lower-level devices actually refer to the same higher-level device (e.g.
multipath or RAID1) by looking at the filesystem UUID. It uses blkid to
query device UUIDs.

libblkid maintains a cache of devices to avoid reprobing all devices
all the time. This cache normally gets updated (I *think* via udev,
but I'm not sure) when changes occur. But something changed recently
at least in the multipath case where the cache is only updated for the
multipathed device, but not the underlying backing paths.

This then leads `rdcore` to think that they're separate devices. We
probably should make `rdcore` smarter here in how it handles multipath
devices, but still we don't want to have this stale cache around for
the sake of other tools relying on it.

We started hitting this more frequently starting with kernel v6.0.17,
but the issue triggers equally as easily on v6.0.16 when reproduced
artificially. So I think we've just been lucky so far that this hasn't
bit us (possibly we raced with another service that helped refresh the
cache).

There's likely a bug here either in the kernel, or multipath or blkid.
This is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=2162151.
Until then, nuke the blkid cache to force a reprobe on the next call.

Closes: coreos/fedora-coreos-tracker#1373
@jlebon
Copy link
Member Author

jlebon commented Jan 18, 2023

dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue Jan 19, 2023
We're hitting an issue right now where
`coreos-ignition-unique-boot.service` (backed by `rdcore`) is failing
on multipath with:

```
Error: System has 2 devices with a filesystem labeled 'boot': ["/dev/sdb3", "/dev/mapper/mpatha3"]
```

The unique label detection code in `rdcore` determines whether multiple
lower-level devices actually refer to the same higher-level device (e.g.
multipath or RAID1) by looking at the filesystem UUID. It uses blkid to
query device UUIDs.

libblkid maintains a cache of devices to avoid reprobing all devices
all the time. This cache normally gets updated (I *think* via udev,
but I'm not sure) when changes occur. But something changed recently
at least in the multipath case where the cache is only updated for the
multipathed device, but not the underlying backing paths.

This then leads `rdcore` to think that they're separate devices. We
probably should make `rdcore` smarter here in how it handles multipath
devices, but still we don't want to have this stale cache around for
the sake of other tools relying on it.

We started hitting this more frequently starting with kernel v6.0.17,
but the issue triggers equally as easily on v6.0.16 when reproduced
artificially. So I think we've just been lucky so far that this hasn't
bit us (possibly we raced with another service that helped refresh the
cache).

There's likely a bug here either in the kernel, or multipath or blkid.
This is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=2162151.
Until then, nuke the blkid cache to force a reprobe on the next call.

Closes: coreos/fedora-coreos-tracker#1373
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We're hitting an issue right now where
`coreos-ignition-unique-boot.service` (backed by `rdcore`) is failing
on multipath with:

```
Error: System has 2 devices with a filesystem labeled 'boot': ["/dev/sdb3", "/dev/mapper/mpatha3"]
```

The unique label detection code in `rdcore` determines whether multiple
lower-level devices actually refer to the same higher-level device (e.g.
multipath or RAID1) by looking at the filesystem UUID. It uses blkid to
query device UUIDs.

libblkid maintains a cache of devices to avoid reprobing all devices
all the time. This cache normally gets updated (I *think* via udev,
but I'm not sure) when changes occur. But something changed recently
at least in the multipath case where the cache is only updated for the
multipathed device, but not the underlying backing paths.

This then leads `rdcore` to think that they're separate devices. We
probably should make `rdcore` smarter here in how it handles multipath
devices, but still we don't want to have this stale cache around for
the sake of other tools relying on it.

We started hitting this more frequently starting with kernel v6.0.17,
but the issue triggers equally as easily on v6.0.16 when reproduced
artificially. So I think we've just been lucky so far that this hasn't
bit us (possibly we raced with another service that helped refresh the
cache).

There's likely a bug here either in the kernel, or multipath or blkid.
This is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=2162151.
Until then, nuke the blkid cache to force a reprobe on the next call.

Closes: coreos/fedora-coreos-tracker#1373
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
We're hitting an issue right now where
`coreos-ignition-unique-boot.service` (backed by `rdcore`) is failing
on multipath with:

```
Error: System has 2 devices with a filesystem labeled 'boot': ["/dev/sdb3", "/dev/mapper/mpatha3"]
```

The unique label detection code in `rdcore` determines whether multiple
lower-level devices actually refer to the same higher-level device (e.g.
multipath or RAID1) by looking at the filesystem UUID. It uses blkid to
query device UUIDs.

libblkid maintains a cache of devices to avoid reprobing all devices
all the time. This cache normally gets updated (I *think* via udev,
but I'm not sure) when changes occur. But something changed recently
at least in the multipath case where the cache is only updated for the
multipathed device, but not the underlying backing paths.

This then leads `rdcore` to think that they're separate devices. We
probably should make `rdcore` smarter here in how it handles multipath
devices, but still we don't want to have this stale cache around for
the sake of other tools relying on it.

We started hitting this more frequently starting with kernel v6.0.17,
but the issue triggers equally as easily on v6.0.16 when reproduced
artificially. So I think we've just been lucky so far that this hasn't
bit us (possibly we raced with another service that helped refresh the
cache).

There's likely a bug here either in the kernel, or multipath or blkid.
This is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=2162151.
Until then, nuke the blkid cache to force a reprobe on the next call.

Closes: coreos/fedora-coreos-tracker#1373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants