Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't detect/add Mellanox ConnectX-6 VFs via the plugin on my Openshift(on Openstack installation) #572

Open
nmcconom opened this issue Jul 10, 2024 · 9 comments

Comments

@nmcconom
Copy link

What happened?

I have configured the plugin look for my Mellanox ConnectX-6 VFs on my nodes - they are there and appear to be detected on the node but they are never added to the Resource Pools for some reason

What did you expect to happen?

VFs pulled into the respective pools so they can be used in my pods

What are the minimal steps needed to reproduce the bug?

Mellanox ConnectX-6 VFs made available on one or more of your Openshift nodes and configured plugin to try and find them

Anything else we need to know?

lspci output from node
05:00.0 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
Subsystem: Mellanox Technologies Device [15b3:0012]
Physical Slot: 0-4
Flags: bus master, fast devsel, latency 0
Memory at fba00000 (64-bit, prefetchable) [size=1M]
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [9c] MSI-X: Enable+ Count=12 Masked-
Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?>
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core

06:00.0 Ethernet controller [0200]: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function [15b3:101e]
Subsystem: Mellanox Technologies Device [15b3:0012]
Physical Slot: 0-5
Flags: bus master, fast devsel, latency 0
Memory at fb800000 (64-bit, prefetchable) [size=1M]
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [9c] MSI-X: Enable+ Count=12 Masked-
Capabilities: [100] Vendor Specific Information: ID=0000 Rev=0 Len=00c <?>
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core

Component Versions

Please fill in the below table with the version numbers of components used.

Component Version
SR-IOV Network Device Plugin 3.7.0
SR-IOV CNI Plugin Openshift 4.12.42
Multus Openshift 4.12.42
Kubernetes 1.25
OS Openshift 4.12/RHCOS 8.6

Config Files

ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: sriovdp-config
namespace: kube-system
data:
config.json: |
{
"resourceList": [
{
"resourceName": "sriov_client_side",
"resourcePrefix": "mellanox",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"drivers": ["netdevice"],
"pciAddresses": ["0000:00:05.0"]
}
},
{
"resourceName": "sriov_server_side",
"resourcePrefix": "mellanox",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"drivers": ["netdevice"],
"pciAddresses": ["0000:00:06.0"]
}
}
]
}

Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')

{"cniVersion":"0.4.0","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxsize":100,"logfile-maxbackups":5,"logfile-maxage":5}sh-4.4#

CNI config (Try '/etc/cni/net.d/')

{ "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "binDir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/10-ovn-kubernetes.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ {"cniVersion":"0.4.0","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxsize":100,"logfile-maxbackups":5,"logfile-maxage":5} ] }

Kubernetes deployment type ( Bare Metal, Kubeadm etc.)

Openshift 4.12.42

Kubeconfig file
SR-IOV Network Custom Resource Definition

Logs

SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)

I0710 11:28:06.727499 1 manager.go:57] Using Kubelet Plugin Registry Mode
I0710 11:28:06.727846 1 main.go:46] resource manager reading configs
I0710 11:28:06.727909 1 manager.go:86] raw ResourceList: {
"resourceList": [
{
"resourceName": "sriov_client_side",
"resourcePrefix": "mellanox",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"drivers": ["netdevice"],
"pciAddresses": ["0000:00:05.0"]
}
},
{
"resourceName": "sriov_server_side",
"resourcePrefix": "mellanox",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"drivers": ["netdevice"],
"pciAddresses": ["0000:00:06.0"]
}
}
]
}
I0710 11:28:06.728042 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_client_side is [0xc00042a900]
I0710 11:28:06.728085 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_server_side is [0xc00042ac60]
I0710 11:28:06.728092 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc000400eb8 AdditionalInfo:map[] SelectorObjs:[0xc00042a900]} {ResourcePrefix:mellanox ResourceName:sriov_server_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc000400ed0 AdditionalInfo:map[] SelectorObjs:[0xc00042ac60]}]
I0710 11:28:06.728152 1 manager.go:217] validating resource name "mellanox/sriov_client_side"
I0710 11:28:06.728203 1 manager.go:217] validating resource name "mellanox/sriov_server_side"
I0710 11:28:06.728210 1 main.go:62] Discovering host devices
I0710 11:28:06.845790 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0710 11:28:06.845883 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0710 11:28:06.845894 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0710 11:28:06.845901 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0710 11:28:06.845942 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0710 11:28:06.846313 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0710 11:28:06.846510 1 main.go:68] Initializing resource servers
I0710 11:28:06.846526 1 manager.go:117] number of config: 2
I0710 11:28:06.846544 1 manager.go:121] Creating new ResourcePool: sriov_client_side
I0710 11:28:06.846548 1 manager.go:122] DeviceType: netDevice
I0710 11:28:06.847037 1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0710 11:28:06.847055 1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_client_side
I0710 11:28:06.847061 1 manager.go:121] Creating new ResourcePool: sriov_server_side
I0710 11:28:06.847066 1 manager.go:122] DeviceType: netDevice
I0710 11:28:06.847495 1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0710 11:28:06.847512 1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_server_side
I0710 11:28:06.847518 1 main.go:74] Starting all servers...
I0710 11:28:06.847523 1 main.go:79] All servers started.
I0710 11:28:06.847529 1 main.go:80] Listening for term signals

Multus logs (If enabled. Try '/var/log/multus.log' )

2024-07-10T11:04:25+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_f3bb1262-de44-46c1-8d11-2b04b60ac649 to /host/opt/cni/bin/
2024-07-10T11:04:25+00:00 WARN: {unknown parameter "-"}
2024-07-10T11:04:25+00:00 Entrypoint skipped copying Multus binary.
2024-07-10T11:04:25+00:00 Generating Multus configuration file using files in /host/var/run/multus/cni/net.d...
2024-07-10T11:04:25+00:00 Attempting to find master plugin configuration, attempt 0
2024-07-10T11:04:29+00:00 Using MASTER_PLUGIN: 10-ovn-kubernetes.conf
2024-07-10T11:04:29+00:00 Nested capabilities string:
2024-07-10T11:04:29+00:00 Using /host/var/run/multus/cni/net.d/10-ovn-kubernetes.conf as a source to generate the Multus configuration
2024-07-10T11:04:29+00:00 Config file created @ /host/etc/cni/net.d/00-multus.conf
{ "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "binDir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/10-ovn-kubernetes.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ {"cniVersion":"0.4.0","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{},"logFile":"/var/log/ovn-kubernetes/ovn-k8s-cni-overlay.log","logLevel":"4","logfile-maxsize":100,"logfile-maxbackups":5,"logfile-maxage":5} ] }
2024-07-10T11:04:29+00:00 Entering watch loop...

Kubelet logs (journalctl -u kubelet)
@SchSeba
Copy link
Collaborator

SchSeba commented Jul 14, 2024

The PCI address in the config is not right.

your config: "pciAddresses": ["0000:00:06.0"]
the device discovered by the device plugin 0000:06:00.0

@nmcconom
Copy link
Author

Hi - I corrected that error in the ConfigMap - but it was still the same end result of 0 devices being added

See updated output log below

I0715 12:48:59.624977       1 manager.go:57] Using Kubelet Plugin Registry Mode
I0715 12:48:59.626222       1 main.go:46] resource manager reading configs
I0715 12:48:59.626341       1 manager.go:86] raw ResourceList: {
"resourceList": [
    {
        "resourceName": "sriov_client_side",
        "resourcePrefix": "mellanox",
        "selectors": {
            "vendors": ["15b3"],
            "devices": ["101e"],
            "drivers": ["netdevice"],
            "pciAddresses": ["0000:05:00.0"]
        }
    },
    {
        "resourceName": "sriov_server_side",
        "resourcePrefix": "mellanox",
        "selectors": {
            "vendors": ["15b3"],
            "devices": ["101e"],
            "drivers": ["netdevice"],
            "pciAddresses": ["0000:06:00.0"]
        }
    }
  ]
}
I0715 12:48:59.626637       1 factory.go:211] *types.NetDeviceSelectors for resource sriov_client_side is [0xc00017c240]
I0715 12:48:59.626668       1 factory.go:211] *types.NetDeviceSelectors for resource sriov_server_side is [0xc00017c5a0]
I0715 12:48:59.626675       1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc00012c330 AdditionalInfo:map[] SelectorObjs:[0xc00017c240]} {ResourcePrefix:mellanox ResourceName:sriov_server_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc00012c348 AdditionalInfo:map[] SelectorObjs:[0xc00017c5a0]}]
I0715 12:48:59.626862       1 manager.go:217] validating resource name "mellanox/sriov_client_side"
I0715 12:48:59.626893       1 manager.go:217] validating resource name "mellanox/sriov_server_side"
I0715 12:48:59.627022       1 main.go:62] Discovering host devices
I0715 12:48:59.726479       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0	02          	Red Hat, Inc.       	Virtio 1.0 network device               
I0715 12:48:59.726578       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I0715 12:48:59.727010       1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I0715 12:48:59.727205       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0	02          	Red Hat, Inc.       	Virtio 1.0 network device               
I0715 12:48:59.727231       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I0715 12:48:59.727237       1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0	02          	Mellanox Technolo...	ConnectX Family mlx5Gen Virtual Function
I0715 12:48:59.727250       1 main.go:68] Initializing resource servers
I0715 12:48:59.727256       1 manager.go:117] number of config: 2
I0715 12:48:59.727267       1 manager.go:121] Creating new ResourcePool: sriov_client_side
I0715 12:48:59.727273       1 manager.go:122] DeviceType: netDevice
I0715 12:48:59.727797       1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0715 12:48:59.727813       1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_client_side
I0715 12:48:59.727819       1 manager.go:121] Creating new ResourcePool: sriov_server_side
I0715 12:48:59.727824       1 manager.go:122] DeviceType: netDevice
I0715 12:48:59.756721       1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0715 12:48:59.756744       1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_server_side
I0715 12:48:59.756750       1 main.go:74] Starting all servers...
I0715 12:48:59.756757       1 main.go:79] All servers started.
I0715 12:48:59.756762       1 main.go:80] Listening for term signals

@SchSeba
Copy link
Collaborator

SchSeba commented Jul 17, 2024

one more step for virtual env can you remove

"vendors": ["15b3"],
            "devices": ["101e"],
            "drivers": ["netdevice"],

from the configmap please leave only the pciAddress

@nmcconom
Copy link
Author

nmcconom commented Jul 17, 2024

I tried that but with same end result unfortunately.

Logs below for that attempt

I0717 13:12:36.794382 1 manager.go:57] Using Kubelet Plugin Registry Mode
I0717 13:12:36.794710 1 main.go:46] resource manager reading configs
I0717 13:12:36.794782 1 manager.go:86] raw ResourceList: {
"resourceList": [
{
"resourceName": "sriov_client_side",
"resourcePrefix": "mellanox",
"selectors": {
"pciAddresses": ["0000:05:00.0"]
}
},
{
"resourceName": "sriov_server_side",
"resourcePrefix": "mellanox",
"selectors": {
"pciAddresses": ["0000:06:00.0"]
}
}
]
}
I0717 13:12:36.794930 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_client_side is [0xc0004070e0]
I0717 13:12:36.794955 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_server_side is [0xc000407440]
I0717 13:12:36.794962 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc0003f2ed0 AdditionalInfo:map[] SelectorObjs:[0xc0004070e0]} {ResourcePrefix:mellanox ResourceName:sriov_server_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc0003f2ee8 AdditionalInfo:map[] SelectorObjs:[0xc000407440]}]
I0717 13:12:36.795051 1 manager.go:217] validating resource name "mellanox/sriov_client_side"
I0717 13:12:36.795075 1 manager.go:217] validating resource name "mellanox/sriov_server_side"
I0717 13:12:36.795081 1 main.go:62] Discovering host devices
I0717 13:12:36.876613 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 13:12:36.876675 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 13:12:36.876683 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 13:12:36.876690 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 13:12:36.876747 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 13:12:36.877186 1 utils.go:494] excluding interface enp5s0: default route found: {Ifindex: 3 Dst: <nil> Src: 172.26.13.75 Gw: 172.26.13.1 Flags: [] Table: 254 Realm: 0}
I0717 13:12:36.877254 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 13:12:36.877429 1 utils.go:494] excluding interface enp6s0: default route found: {Ifindex: 4 Dst: <nil> Src: 172.26.14.175 Gw: 172.26.14.1 Flags: [] Table: 254 Realm: 0}
I0717 13:12:36.877455 1 main.go:68] Initializing resource servers
I0717 13:12:36.877463 1 manager.go:117] number of config: 2
I0717 13:12:36.877469 1 manager.go:121] Creating new ResourcePool: sriov_client_side
I0717 13:12:36.877487 1 manager.go:122] DeviceType: netDevice
I0717 13:12:36.877633 1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0717 13:12:36.877649 1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_client_side
I0717 13:12:36.877655 1 manager.go:121] Creating new ResourcePool: sriov_server_side
I0717 13:12:36.877659 1 manager.go:122] DeviceType: netDevice
I0717 13:12:36.877749 1 manager.go:138] initServers(): selector index 0 will register 0 devices
I0717 13:12:36.877762 1 manager.go:142] no devices in device pool, skipping creating resource server for sriov_server_side
I0717 13:12:36.877766 1 main.go:74] Starting all servers...
I0717 13:12:36.877772 1 main.go:79] All servers started.
I0717 13:12:36.877777 1 main.go:80] Listening for term signals

@nmcconom
Copy link
Author

Noticed below line so brought the interface down before restarting device plugin pod

I0717 13:12:36.877186 1 utils.go:494] excluding interface enp5s0: default route found: {Ifindex: 3 Dst: <nil> Src: 172.26.13.75 Gw: 172.26.13.1 Flags: [] Table: 254 Realm: 0}

That seems to allow it to discover them OK.

I0717 14:14:23.049116 1 manager.go:57] Using Kubelet Plugin Registry Mode
I0717 14:14:23.050546 1 main.go:46] resource manager reading configs
I0717 14:14:23.050650 1 manager.go:86] raw ResourceList: {
"resourceList": [
{
"resourceName": "sriov_client_side",
"resourcePrefix": "mellanox",
"selectors": {
"pciAddresses": ["0000:05:00.0"]
}
},
{
"resourceName": "sriov_internet_side",
"resourcePrefix": "mellanox",
"selectors": {
"pciAddresses": ["0000:06:00.0"]
}
}
]
}
I0717 14:14:23.051034 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_client_side is [0xc0001d8240]
I0717 14:14:23.051117 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_internet_side is [0xc0000df440]
I0717 14:14:23.051970 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc00019a330 AdditionalInfo:map[] SelectorObjs:[0xc0001d8240]} {ResourcePrefix:mellanox ResourceName:sriov_internet_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc00019a348 AdditionalInfo:map[] SelectorObjs:[0xc0000df440]}]
I0717 14:14:23.052090 1 manager.go:217] validating resource name "mellanox/sriov_client_side"
I0717 14:14:23.052128 1 manager.go:217] validating resource name "mellanox/sriov_internet_side"
I0717 14:14:23.052139 1 main.go:62] Discovering host devices
I0717 14:14:23.136162 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 14:14:23.136287 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 14:14:23.136791 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 14:14:23.137021 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 14:14:23.137054 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 14:14:23.137062 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 14:14:23.137081 1 main.go:68] Initializing resource servers
I0717 14:14:23.137088 1 manager.go:117] number of config: 2
I0717 14:14:23.137101 1 manager.go:121] Creating new ResourcePool: sriov_client_side
I0717 14:14:23.137106 1 manager.go:122] DeviceType: netDevice
I0717 14:14:23.137648 1 manager.go:138] initServers(): selector index 0 will register 1 devices
I0717 14:14:23.137683 1 factory.go:124] device added: [identifier: 0000:05:00.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0717 14:14:23.137722 1 manager.go:156] New resource server is created for sriov_client_side ResourcePool
I0717 14:14:23.137731 1 manager.go:121] Creating new ResourcePool: sriov_internet_side
I0717 14:14:23.137736 1 manager.go:122] DeviceType: netDevice
I0717 14:14:23.138214 1 manager.go:138] initServers(): selector index 0 will register 1 devices
I0717 14:14:23.138237 1 factory.go:124] device added: [identifier: 0000:06:00.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0717 14:14:23.138253 1 manager.go:156] New resource server is created for sriov_internet_side ResourcePool
I0717 14:14:23.138260 1 main.go:74] Starting all servers...
I0717 14:14:23.138492 1 server.go:255] starting sriov_client_side device plugin endpoint at: mellanox_sriov_client_side.sock
I0717 14:14:23.139287 1 server.go:297] sriov_client_side device plugin endpoint started serving
I0717 14:14:23.139413 1 server.go:255] starting sriov_internet_side device plugin endpoint at: mellanox_sriov_internet_side.sock
I0717 14:14:23.139732 1 server.go:297] sriov_internet_side device plugin endpoint started serving
I0717 14:14:23.139752 1 main.go:79] All servers started.
I0717 14:14:23.139759 1 main.go:80] Listening for term signals

@nmcconom
Copy link
Author

Any idea why it didn't like the more specific filters? We were able to use these with our Intel based cards.

@nmcconom
Copy link
Author

Added back in the vendors and devices attributes and that worked also - so it seemed it didn't like the netdevice driver

We use vfio-pci for our Intel cards and Openshift documentation had pointed us at setting netdevice for Mellanox cards - just for background on why we had used that

I0717 16:42:30.242700 1 manager.go:86] raw ResourceList: {
"resourceList": [
{
"resourceName": "sriov_client_side",
"resourcePrefix": "mellanox",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"pciAddresses": ["0000:05:00.0"]
}
},
{
"resourceName": "sriov_internet_side",
"resourcePrefix": "mellanox",
"selectors": {
"vendors": ["15b3"],
"devices": ["101e"],
"pciAddresses": ["0000:06:00.0"]
}
}
]
}
I0717 16:42:30.242817 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_client_side is [0xc00052a900]
I0717 16:42:30.242846 1 factory.go:211] *types.NetDeviceSelectors for resource sriov_internet_side is [0xc00052ac60]
I0717 16:42:30.242852 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox ResourceName:sriov_client_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc000500e88 AdditionalInfo:map[] SelectorObjs:[0xc00052a900]} {ResourcePrefix:mellanox ResourceName:sriov_internet_side DeviceType:netDevice ExcludeTopology:false Selectors:0xc000500ea0 AdditionalInfo:map[] SelectorObjs:[0xc00052ac60]}]
I0717 16:42:30.242942 1 manager.go:217] validating resource name "mellanox/sriov_client_side"
I0717 16:42:30.242967 1 manager.go:217] validating resource name "mellanox/sriov_internet_side"
I0717 16:42:30.242973 1 main.go:62] Discovering host devices
I0717 16:42:30.320299 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 16:42:30.320385 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 16:42:30.320394 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 16:42:30.320403 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:03:00.0 02 Red Hat, Inc. Virtio 1.0 network device
I0717 16:42:30.320463 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 16:42:30.320866 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:06:00.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0717 16:42:30.321290 1 main.go:68] Initializing resource servers
I0717 16:42:30.321316 1 manager.go:117] number of config: 2
I0717 16:42:30.321338 1 manager.go:121] Creating new ResourcePool: sriov_client_side
I0717 16:42:30.321347 1 manager.go:122] DeviceType: netDevice
I0717 16:42:30.322346 1 manager.go:138] initServers(): selector index 0 will register 1 devices
I0717 16:42:30.322390 1 factory.go:124] device added: [identifier: 0000:05:00.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0717 16:42:30.322444 1 manager.go:156] New resource server is created for sriov_client_side ResourcePool
I0717 16:42:30.322460 1 manager.go:121] Creating new ResourcePool: sriov_internet_side
I0717 16:42:30.322464 1 manager.go:122] DeviceType: netDevice
I0717 16:42:30.322978 1 manager.go:138] initServers(): selector index 0 will register 1 devices
I0717 16:42:30.323000 1 factory.go:124] device added: [identifier: 0000:06:00.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0717 16:42:30.323027 1 manager.go:156] New resource server is created for sriov_internet_side ResourcePool
I0717 16:42:30.323035 1 main.go:74] Starting all servers...
I0717 16:42:30.323324 1 server.go:255] starting sriov_client_side device plugin endpoint at: mellanox_sriov_client_side.sock
I0717 16:42:30.324284 1 server.go:297] sriov_client_side device plugin endpoint started serving
I0717 16:42:30.324699 1 server.go:255] starting sriov_internet_side device plugin endpoint at: mellanox_sriov_internet_side.sock
I0717 16:42:30.325092 1 server.go:297] sriov_internet_side device plugin endpoint started serving
I0717 16:42:30.325115 1 main.go:79] All servers started.
I0717 16:42:30.325123 1 main.go:80] Listening for term signals
I0717 16:42:30.780189 1 server.go:117] Plugin: mellanox_sriov_client_side.sock gets registered successfully at Kubelet
I0717 16:42:30.780439 1 server.go:117] Plugin: mellanox_sriov_internet_side.sock gets registered successfully at Kubelet
I0717 16:42:30.780571 1 server.go:158] ListAndWatch(sriov_client_side) invoked
I0717 16:42:30.780621 1 server.go:171] ListAndWatch(sriov_client_side): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:05:00.0,Health:Healthy,Topology:nil,},},}
I0717 16:42:30.780561 1 server.go:158] ListAndWatch(sriov_internet_side) invoked
I0717 16:42:30.780719 1 server.go:171] ListAndWatch(sriov_internet_side): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:06:00.0,Health:Healthy,Topology:nil,},},}

@SchSeba
Copy link
Collaborator

SchSeba commented Aug 19, 2024

That is because in this case where the device plugin runs on a VM where only the VFs exist (and not the all PF) it's not a netdevice.

please check the shiftonstack documentation. the openshift documentation is for baremetal where the VFs for mellanox devices should be netdevice

@SchSeba
Copy link
Collaborator

SchSeba commented Aug 19, 2024

let me know if I can close this issue :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants