Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8S kubevirt 'allocatable: devices.kubevirt.io/vhost-net: "0"' with Flatcar 3850.0.0+ #1336

Closed
ader1990 opened this issue Jan 30, 2024 · 13 comments · Fixed by flatcar/scripts#1654
Labels
kind/bug Something isn't working

Comments

@ader1990
Copy link

Description

If K8S + kubevirt is installed on Flatcar 3850.0.0+, the allocatable vhost-net devices are 0.

Impact

This issue impacts creating k8s kubevirt vms (no vms can be created if there are no allocatable vhost-net devices).

Environment and steps to reproduce

My environment was a k8s created on baremetal ARM64 using Flatcar 3850.0.0 stock image and automation from https://github.com/cloudbase/BMK/tree/flatcar_sysext.

Expected behavior

$: kubect get node -A -o yaml | grep -i vhost-net
allocatable: devices.kubevirt.io/vhost-net: 1k

Additional information

If the node is rebooted, the vhost-net allocatable devices are back to the expected size.
Sometimes, the issue cannot be reproduced, which means it is a race condition.
The issue is not present on the current stable or beta releases.

When trying to debug this issue, saw that the kubevirt implementation tries to open the /dev/vhost-net device file in order for the vhost-net kernel module to be autoloaded. I have created a small golang test script and I can confirm that opening the device file does not autoload the kernel module. More debug is needed to see if the 6.6 Linux kernel has module autoload disabled?

@ader1990
Copy link
Author

My golang repro code:

package main

import "fmt"
import "os"

func main() {
    devnode, err := os.Open("/dev/vhost-net")
    if err == nil {
        fmt.Println("/dev/vhost-net opened")
        devnode.Close()
    } else {
        fmt.Println("/dev/vhost-net failed to open")
    }
}

When I run it on the Flatcar env without the vhost-net module preloaded, I get /dev/vhost-net failed to open.

@jepio
Copy link
Member

jepio commented Jan 31, 2024

@ader1990 are you running this from a systemd unit early in boot or something?
It works fine here:

$ ssh core
Warning: Permanently added '[localhost]:2222' (ED25519) to the list of known hosts.
Last login: Wed Jan 31 08:46:28 UTC 2024 on tty1
Flatcar Container Linux by Kinvolk alpha 3850.0.0 for QEMU
core@localhost ~ $ lsmod | grep vhost
core@localhost ~ $ ls -la /dev/vhost-net
crw-rw-rw-. 1 root kvm 10, 238 Jan 31 08:46 /dev/vhost-net
core@localhost ~ $ ./main
/dev/vhost-net opened
core@localhost ~ $ lsmod | grep vhost
vhost_net              36864  0
tun                    69632  1 vhost_net
vhost                  65536  1 vhost_net
vhost_iotlb            16384  1 vhost
tap                    28672  1 vhost_net
core@localhost ~ $

@ader1990
Copy link
Author

lsmod | grep vhost

Hello @jepio,

I am running manually after the normal boot process, on a baremetal ARM64 server. I have also tried on a Hyper-V VM x64, and I get the same issue. When the VM is a QEMU-KVM, I think it gets automatically loaded, because of the underlying virtualization. I have used the https://alpha.release.flatcar-linux.net/arm64-usr/current/flatcar_production_image.bin.bz2 image.

From my testing, only a modprobe vhost_net can reliably create the /dev/vhost-net.

Before opening an issue in the kubevirt repo, I will try the upstream master of the kubevirt, just to make sure the issue reliably reproduces. The kubevirt implementation relies on open/close of the file to trigger a module load, which does not seem to work: https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-handler/device-manager/generic_device.go#L117

Thank you,
Adrian

@ader1990
Copy link
Author

@ader1990 are you running this from a systemd unit early in boot or something? It works fine here:

$ ssh core
Warning: Permanently added '[localhost]:2222' (ED25519) to the list of known hosts.
Last login: Wed Jan 31 08:46:28 UTC 2024 on tty1
Flatcar Container Linux by Kinvolk alpha 3850.0.0 for QEMU
core@localhost ~ $ lsmod | grep vhost
core@localhost ~ $ ls -la /dev/vhost-net
crw-rw-rw-. 1 root kvm 10, 238 Jan 31 08:46 /dev/vhost-net
core@localhost ~ $ ./main
/dev/vhost-net opened
core@localhost ~ $ lsmod | grep vhost
vhost_net              36864  0
tun                    69632  1 vhost_net
vhost                  65536  1 vhost_net
vhost_iotlb            16384  1 vhost
tap                    28672  1 vhost_net
core@localhost ~ $

What I also observed is that /dev/vhost-net can be present also if the vhost_net module is not loaded, because of the QEMU implementation. Can you also confirm this scenario on your environment?

@jepio
Copy link
Member

jepio commented Jan 31, 2024

The way module autoloading work is:

  • the module declares an alias on a device node with a specific major:minor
  • userspace precreates a device node with that major:minor
  • when other userspace opens the device node the kernel requests the module
  • udev loads the module

The files involved are:

$ systemctl cat kmod-static-nodes
# /usr/lib/systemd/system/kmod-static-nodes.service
#  SPDX-License-Identifier: LGPL-2.1-or-later
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Create List of Static Device Nodes
DefaultDependencies=no
Before=sysinit.target systemd-tmpfiles-setup-dev.service
ConditionCapability=CAP_SYS_MODULE
ConditionFileNotEmpty=/lib/modules/%v/modules.devname

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/kmod static-nodes --format=tmpfiles --output=/run/tmpfiles.d/static-nodes.conf
$ cat  /lib/modules/$(uname -r)/modules.devname
# Device nodes to trigger on-demand module loading.
fuse fuse c10:229
cuse cuse c10:203
btrfs btrfs-control c10:234
nvram nvram c10:144
loop loop-control c10:237
tun net/tun c10:200
ppp_generic ppp c108:0
dm_mod mapper/control c10:236
vfio vfio/vfio c10:196
vhost_net vhost-net c10:238
vhost_vsock vhost-vsock c10:241
$ cat /run/tmpfiles.d/static-nodes.conf
c! /dev/fuse 0600 - - - 10:229
c! /dev/cuse 0600 - - - 10:203
c! /dev/btrfs-control 0600 - - - 10:234
c! /dev/nvram 0600 - - - 10:144
c! /dev/loop-control 0600 - - - 10:237
d /dev/net 0755 - - -
c! /dev/net/tun 0600 - - - 10:200
c! /dev/ppp 0600 - - - 108:0
d /dev/mapper 0755 - - -
c! /dev/mapper/control 0600 - - - 10:236
d /dev/vfio 0755 - - -
c! /dev/vfio/vfio 0600 - - - 10:196
c! /dev/vhost-net 0600 - - - 10:238
c! /dev/vhost-vsock 0600 - - - 10:241
 $ systemctl cat systemd-tmpfiles-setup-dev
# /usr/lib/systemd/system/systemd-tmpfiles-setup-dev.service
#  SPDX-License-Identifier: LGPL-2.1-or-later
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Create Static Device Nodes in /dev
Documentation=man:tmpfiles.d(5) man:systemd-tmpfiles(8)

DefaultDependencies=no
After=systemd-sysusers.service
Before=sysinit.target local-fs-pre.target systemd-udevd.service
Conflicts=shutdown.target initrd-switch-root.target
Before=shutdown.target initrd-switch-root.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=systemd-tmpfiles --prefix=/dev --create --boot
SuccessExitStatus=DATAERR CANTCREAT
LoadCredential=tmpfiles.extra

I'm not sure why this is failing, perhaps modules.devname is missing from the initrd? We might need a flatcar dev build that prints more info in before those services to see what the file contents are at that time. Both services are supposed to run from the initrd.

@jepio
Copy link
Member

jepio commented Jan 31, 2024

Actually: i'm seeing that on qemu the services run a second time after switch root. Could you figure out why this doesn't happen in Azure/Vmware?

@ader1990
Copy link
Author

On the baremetal ARM64 machine:

cat  /lib/modules/$(uname -r)/modules.devname
# Device nodes to trigger on-demand module loading.
fuse fuse c10:229
cuse cuse c10:203
btrfs btrfs-control c10:234
loop loop-control c10:237
tun net/tun c10:200
ppp_generic ppp c108:0
dm_mod mapper/control c10:236
vhost_net vhost-net c10:238

Seems that the mapping is correct.

@jepio
Copy link
Member

jepio commented Feb 1, 2024

The file itself is correct, but something must be going wrong with the systemd units that create the dev nodes based on that file. I'll leave it to you to investigate.

@ader1990
Copy link
Author

ader1990 commented Feb 1, 2024

Hello,

After various retries, I think I found out the culprit: systemd service systemd-tmpfiles-setup-dev should be creating these links.

If I restart the service, the links are correctly created and udev shows the correct events (module loaded) when trying to access the /dev/vhost-net with open/close (via the golang program).

The problem is that the systemd-tmpfiles-setup-dev sometimes runs before the creation of /run/tmpfiles.d/static-nodes.conf file.
This happens randomly (tried 10 or so reboots).

See bellow, file created at 08:42:15 and systemd-tmpfiles-setup-dev already finished at 08:42:12.

localhost ~ # ls -liath /run/tmpfiles.d/static-nodes.conf --full-time
73 -rw-r--r--. 1 root root 457 2024-02-01 08:42:15.676000000 +0000 /run/tmpfiles.d/static-nodes.conf
localhost ~ # systemctl status systemd-tmpfiles-setup-dev
● systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev
     Loaded: loaded (/usr/lib/systemd/system/systemd-tmpfiles-setup-dev.service; static)
     Active: active (exited) since Thu 2024-02-01 08:42:12 UTC; 3min 39s ago
       Docs: man:tmpfiles.d(5)
             man:systemd-tmpfiles(8)
   Main PID: 175 (code=exited, status=0/SUCCESS)

Feb 01 08:42:12 localhost systemd[1]: Finished systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.

Maybe set a systemd-tmpfiles-setup-dev.service
ConditionFileNotEmpty=/run/tmpfiles.d/static-nodes.conf ?

@jepio
Copy link
Member

jepio commented Feb 1, 2024

That wouldn't be correct since systemd-tmpfiles-setup-dev processes various files, not just that one.

@pothos
Copy link
Member

pothos commented Feb 8, 2024

Adrian found out that the way we pull in systemd-tmpfiles-setup-dev.service early¹ has to do this because kmod-static-nodes.service wouldn't have run then. Upstream kmod-static-nodes.service has the Before= but systemd-tmpfiles-setup-dev.service is missing Wants= to actually pull the unit in. It only gets pulled in in the final system², not in the initrd, by sysinit.target.wants. I would suggest to add an explicit Wants= as drop-in and maybe get this upstream. But we diverge a bit due to how we pull in systemd-tmpfiles-setup-dev.service - maybe there are alternatives and we could do some direct calls with similar effect to create the /dev/ nodes we want instead of starting this service early.

¹ Edit: Note that even before my change we pulled it in for the PXE path to loop mount /usr.squashfs provided by the cpio.
² The sysinit.target is not used in the initrd: https://www.freedesktop.org/software/systemd/man/latest/bootup.html#Bootup%20in%20the%20initrd

@pothos
Copy link
Member

pothos commented Feb 8, 2024

Wait, maybe we should be using systemd-tmpfiles-setup-dev-early.service in bootengine, I hope it does the same for us when it skips things "not safe on boot".

Edit: That still leaves the question if we need to start the kmod service in the initrd, but if we don't start it at least systemd-tmpfiles-setup-dev.service will run in the final system to process its generated files

@pothos
Copy link
Member

pothos commented Feb 8, 2024

In Fedora with a newer systemd version this is what I see as definition:

# /usr/lib/systemd/system/kmod-static-nodes.service
#  SPDX-License-Identifier: LGPL-2.1-or-later
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Create List of Static Device Nodes
DefaultDependencies=no
Before=sysinit.target systemd-tmpfiles-setup-dev-early.service
ConditionCapability=CAP_SYS_MODULE
ConditionFileNotEmpty=/lib/modules/%v/modules.devname

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/kmod static-nodes --format=tmpfiles --output=/run/tmpfiles.d/static-nodes.conf

If systemd-tmpfiles-setup-dev-early.service is not available in systemd 252 we need to first update.
Edit: Looks like systemd-tmpfiles-setup-dev-early doesn't exist in 252 https://alpha.release.flatcar-linux.net/amd64-usr/current/flatcar_production_image_initrd_contents.txt
How about defining out own version until we update?

ader1990 added a commit to ader1990/scripts that referenced this issue Feb 9, 2024
Update the bootengine commit to use the fix from:
flatcar/bootengine#85

Fixes: flatcar/Flatcar#1336

Signed-off-by: Adrian Vladu <avladu@cloudbasesolutions.com>
ader1990 added a commit to ader1990/scripts that referenced this issue Feb 9, 2024
Update the bootengine commit to use the fix from:
flatcar/bootengine#85

Fixes: flatcar/Flatcar#1336

Signed-off-by: Adrian Vladu <avladu@cloudbasesolutions.com>
ader1990 added a commit to flatcar/scripts that referenced this issue Feb 13, 2024
Update the bootengine commit id to use the fix from:
flatcar/bootengine#85

Fixes: flatcar/Flatcar#1336

Signed-off-by: Adrian Vladu <avladu@cloudbasesolutions.com>
ader1990 added a commit to flatcar/scripts that referenced this issue Feb 13, 2024
Update the bootengine commit id to use the fix from:
flatcar/bootengine#85

Fixes: flatcar/Flatcar#1336

Signed-off-by: Adrian Vladu <avladu@cloudbasesolutions.com>
ader1990 added a commit to flatcar/scripts that referenced this issue Feb 14, 2024
Update the bootengine commit id to use the fix from:
flatcar/bootengine#85

Fixes kubevirt vm creation by ensuring that /dev/vhost-net static node gets created
Fixes: flatcar/Flatcar#1336

Signed-off-by: Adrian Vladu <avladu@cloudbasesolutions.com>
@github-project-automation github-project-automation bot moved this from 📝 Needs Triage to Implemented in Flatcar tactical, release planning, and roadmap Feb 14, 2024
pothos pushed a commit to flatcar/scripts that referenced this issue Mar 5, 2024
Update the bootengine commit id to use the fix from:
flatcar/bootengine#85

Fixes kubevirt vm creation by ensuring that /dev/vhost-net static node gets created
Fixes: flatcar/Flatcar#1336

Signed-off-by: Adrian Vladu <avladu@cloudbasesolutions.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
3 participants