Skip to content

Commit

Permalink
doc: vhost acceleration service
Browse files Browse the repository at this point in the history
  • Loading branch information
yajwu committed Jul 5, 2024
1 parent fdd48d1 commit 89696d2
Showing 1 changed file with 386 additions and 0 deletions.
386 changes: 386 additions & 0 deletions vhostd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,386 @@
# VirtIO VF PCI devices for vhost acceleration

Virtio VF PCIe devices can be attached to the guest VM using vhost acceleration software stack. This enables performing live migration of guest VMs. This document describes how to achieve this.

# Prerequisites

1. Minimum hypervisor kernel version - Linux kernel 5.7 (for VFIO SR-IOV support).
2. To use high-availability (the additional vfe-vhostd-ha service which can persist datapath when vfe-vhostd crashes), this linux [kernel patch](https://github.com/torvalds/linux/commit/ffed0518d871482e26c5826c0875bea6775446da) must be applied on hypervisor.

# Install vHost Acceleration Software Stack

Vhost acceleration software stack is built using open-source BSD licensed DPDK.

## Install vhost acceleration software
Pre-built release package is installed to `/opt/mellanox/dpdk-vhost-vfe`. To use `vfe-vhost-cli`, make sure `/opt/mellanox/dpdk-vhost-vfe/bin` is in enviroment variable `PATH`.

1. Clone the software source code, or install previous [releases](https://github.com/Mellanox/dpdk-vhost-vfe/releases):

[host]# git clone https://github.com/Mellanox/dpdk-vhost-vfe

2. Build software:

[host]# apt-get install libev-dev
[host]# yum install -y numactl-devel libev-devel
[host]# meson build --debug -Denable_drivers=vdpa/virtio,common/virtio,common/virtio_mi,common/virtio_ha
[host]# ninja -C build install

3. Start vhostd and vhostd-ha(optional) service:

[host]# systemctl start vfe-vhostd
[host]# systemctl start vfe-vhostd-ha

## Install QEMU

Upstream QEMU later than 8.1 can be used or the following [NVIDIA QEMU][3].

1. Clone NVIDIA QEMU sources:

[host]# git clone https://github.com/Mellanox/qemu -b stable-8.1-presetup

2. Build NVIDIA QEMU:

[host]# mkdir bin
[host]# cd bin
[host]# ../configure --target-list=x86_64-softmmu --enable-kvm
[host]# make -j24


# Steps of using a Vnet or Vblk PCIe VF in Virtual Machine

1. Set the DPU nvconfig to enable virtio static PF with SR-IOV VF. Make sure virtio PF exists on host and SR-IOV enabled. For detail, refer [NVIDIA BlueField Virtio-net][1] and [NVIDIA BlueField-3 SNAP for NVMe and Virtio-blk][2].

2. Configure huge pages:

[host]# mkdir /dev/hugepages1G
[host]# mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G
[host]# echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
[host]# echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

3. Configure libvirt VM XML.

1. Open the VM’s configuration XML for editing:

virsh edit <domain name>

2. Change the top line to:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

3. Assign a memory amount and use 1GB page size for huge pages:

<memory unit='GiB'>4</memory>
<currentMemory unit='GiB'>4</currentMemory>
<memoryBacking>
<hugepages>
<page size='1' unit='GiB'/>
</hugepages>
</memoryBacking>

4. Set the memory access for the CPUs to be shared:

<cpu mode='custom' match='exact' check='partial'>
<model fallback='allow'>Skylake-Server-IBRS</model>
<numa>
<cell id='0' cpus='0-1' memory='4' unit='GiB' memAccess='shared'/>
</numa>
</cpu>

5. Add a virtio-net interface in VM XML:

<qemu:commandline>
<qemu:arg value='-chardev'/>
<qemu:arg value='socket,id=char0,path=/tmp/vhost-net0,server=on'/>
<qemu:arg value='-netdev'/>
<qemu:arg value='type=vhost-user,id=vhost1,chardev=char0,queues=4'/>
<qemu:arg value='-device'/>
<qemu:arg value='virtio-net-pci,netdev=vhost1,mac=00:00:00:00:33:00,vectors=10,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024,mq=on,disable-legacy=on,disable-modern=off'/>
</qemu:commandline>

6. Add a a virtio-blk interface in VM XML:

<qemu:commandline>
<qemu:arg value='-chardev'/>
<qemu:arg value='socket,id=char1,path=/tmp/vhost-blk0,server=on'/>
<qemu:arg value='-device'/>
<qemu:arg value='vhost-user-blk-pci,chardev=char1,page-per-vq=on,num-queues=4,disable-legacy=on,disable-modern=off'/>
</qemu:commandline>

4. Bind PCIe VF device to vfio-pci driver and enable SR-IOV.

* Bind the virtio-net PF devices to vfio-pci driver and create 1 virtio-net VF:

[host]# lspci -s 0000:af:00.2
af:00.2 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01)

[host]# modprobe vfio vfio_pci
[host]# echo 1 > /sys/module/vfio_pci/parameters/enable_sriov
[host]# echo 0x1af4 0x1041 > /sys/bus/pci/drivers/vfio-pci/new_id
[host]# echo 0000:af:00.2 > /sys/bus/pci/drivers/vfio-pci/bind

[host]# lspci -vvv -s 0000:af:00.2 | grep "Kernel driver"
Kernel driver in use: vfio-pci
Enable SR-IOV and create a virtio-net VF(s):

[host]# echo 1 > /sys/bus/pci/devices/0000:af:00.2/sriov_numvfs

[host]# lspci | grep Virtio
af:00.2 Ethernet controller: Red Hat, Inc. Virtio network device
af:04.5 Ethernet controller: Red Hat, Inc. Virtio network device

* Bind the virtio-blk PF devices to vfio-pci driver and create 1 virtio-blk VF:

[host]# lspci -s 0000:af:00.3
af:00.3 Non-Volatile memory controller: Red Hat, Inc. Virtio block device (rev 01)

[host]# modprobe vfio vfio_pci
[host]# echo 1 > /sys/module/vfio_pci/parameters/enable_sriov
[host]# echo 0x1af4 0x1042 > /sys/bus/pci/drivers/vfio-pci/new_id
[host]# echo 0000:af:00.3 > /sys/bus/pci/drivers/vfio-pci/bind

[host]# lspci -vvv -s 0000:af:00.3 | grep "Kernel driver"
Kernel driver in use: vfio-pci
Enable SR-IOV and create a virtio-blkVF(s):

[host]# echo 1 > /sys/bus/pci/devices/0000:af:00.3/sriov_numvfs

[host]# lspci | grep Virtio
af:00.3 Non-Volatile memory controller: Red Hat, Inc. Virtio block device
af:05.1 Non-Volatile memory controller: Red Hat, Inc. Virtio block device


4. Provision and add VF device to vhostd service:

* For virtio-net VF:

1. Add a VF representor to the OVS bridge on the DPU:

[dpu]# virtnet query -p 0 -v 0 | grep sf_rep_net_device
"sf_rep_net_device": "en3f0pf0sf3000",
[dpu]# ovs-vsctl add-port ovsbr1 en3f0pf0sf3000

2. Add PF device and wait virtio-net-controller finishing handle PF FLR:

[host]# vfe-vhost-cli mgmtpf -a 0000:af:00.2

3. Provision VF on DPU(optional):

[dpu]# virtnet modify -p 0 -v 0 device -m 00:00:00:00:33:00

4. Add VF to vhostd service:

[host]# vfe-vhost-cli vf -a 0000:af:04.5 -v /tmp/vhost-net0

* For virtio-blk VF:

1. Create block device on the DPU:

On BlueField-3 SNAP:
[dpu]# spdk_rpc.py bdev_null_create Null0 1024 512
[dpu]# snap_rpc.py virtio_blk_controller_create --pf_id 0 --bdev Null0 --num_queues 1 --admin_q
For BlueField-3 feature: shared memory based recovery(environment variable VBLK_RECOVERY_SHM) is developed to substitute --force_in_order.

On BlueField-2 SNAP:

[dpu]# spdk_rpc.py bdev_null_create Null0 1024 512
[dpu]# snap_rpc.py controller_virtio_blk_create --pf_id 0 --bdev_type spdk mlx5_0 --bdev Null0 --num_queues 1 --admin_q --force_in_order

2. Add virtio-blk PF to vhost acceleration service:

[host]# vfe-vhost-cli mgmtpf -a 0000:af:00.3
# Wait on SNAP controller to finish handling PF FLR

# On DPU, the user must create a VF device controller before adding the VF device to the vhostd
# service or after pf or vf device delete from vhostd service , or vhostd service restart:
# For BlueField-3, the VF controller is automatically recreated
# For BlueField-2, the VF controller must be manually recreated
# Use snap_rpc.py controller_list to check for controller exsistence and create controller if it's not there
[dpu]# snap_rpc.py controller_virtio_blk_create mlx5_0 --pf_id 0 --vf_id 0 --bdev_type spdk --bdev Null0 --force_in_order

3. Add virtio-blk VF to vhostd service:

[host]# vfe-vhost-cli vf -a 0000:af:05.1 -v /tmp/vhost-blk0

5. Start the VM:

virsh start <domain-name>

6. Remove Device.

When finished using the virtio device, use following commands to remove them from vhostd service:

[host]# vfe-vhost-cli vf -r 0000:af:04.5
[host]# vfe-vhost-cli mgmtpf -r 0000:af:00.2

[host]# vfe-vhost-cli vf -r 0000:af:05.1
[host]# vfe-vhost-cli vf -r 0000:af:00.3

# Live Migration Virtual Machine

Prepare two identical hosts and perform the steps of adding the virtio devices to vhostd service on both server. Boot the virtual machine on one server and live migration it to another with command like:

[host]# virsh migrate --verbose --live --persistent gen-l-vrt-440-162-CentOS-7.4 qemu+ssh://gen-l-vrt-439/system --unsafe

# Vhost Acceleration Service

## Vfe-vhostd Service

This service communicates to QEMU vhost-user front-end and programs virtio VF through VFIO and virtio driver.

* Start vfe-vhostd service:

[host]# systemctl start vfe-vhostd

* Stop vfe-vhostd service:

[host]# systemctl start vfe-vhostd

* Check vfe-vhostd service log:

[host]# journalctl -u vfe-vhostd

## Vfe-vhostd-ha Service

Running vfe-vhostd-ha service allows datapath to persist in case vfe-vhostd crash. vhostd service and vhostd-ha service will connect each other through unix domain socket. So vhostd-ha service can get information from vhostd service and give back to vhostd service for recovery.

* Start/Stop/Log of the vfe-vhostd-ha service

[host]# systemctl start vfe-vhostd-ha
[host]# systemctl stop vfe-vhostd-ha
[host]# journalctl -u vfe-vhostd-ha

## Hot upgrade

After install new software package:

systemctl restart vfe-vhostd-ha
systemctl restart vfe-vhostd

## Vfe-vhost-cli RPC Commands

### Show vfe-vhostd version

[host]# vfe-vhost-cli version
{
"vfe-vhostd version": "DPDK 22.07.0-rc0.mlx_vfe_vdpa-1.2",
"vfe-vhostd build time": "May 20 2024 18:08:50",
"vfe-vhostd-ha version": "DPDK 22.07.0-rc0.mlx_vfe_vdpa-1.2",
"vfe-vhostd-ha build time": "May 20 2024 18:08:50",
"errno": 0,
"errstr": "Success"
}

### Add/Remove/List PF device

Before adding any VF device, PF device must be added in advance. Before removing any PF device, all VF devices must be removed in advance.

[host]# vfe-vhost-cli mgmtpf -h
usage: vfe-vhost-cli mgmtpf [-h] [-a | -r | -l] [DEVICE]

positional arguments:
DEVICE Device specified as PCI "domain:bus:slot.func" syntax or "bus:slot.func" syntax. For device
add/remove to drivers, they may be referred to by interface name.

optional arguments:
-h, --help show this help message and exit
-a, --add add a pci device
-r, --remove remove a pci device
-l, --list list all PF devices

### Add/Remove/List VF device

[host]# vfe-vhost-cli vf -h
usage: vfe-vhost-cli vf [-h] [-a | -r | -l | -i | -d] [-o test_operation] [-b test_size_mode] [-v vhost_socket]
[-u vm_uuid]
[DEVICE]

positional arguments:
DEVICE Device specified as PCI "domain:bus:slot.func" syntax or "bus:slot.func" syntax. For device
add/remove to drivers, they may be referred to by interface name.

optional arguments:
-h, --help show this help message and exit
-a, --add add a pci device
-r, --remove remove a pci device
-l, --list list all VF devices of PF device
-i, --info show specified VF device information
-d, --debug test VF device debug
-v vhost_socket Vhost socket file name
-u vm_uuid Virtual machine UUID


# QEMU

## Parameters for QEMU emulated virtio device

### Direct doorbell mapping
QEMU's direct doorbell mapping feature can reduce vmexit when guest driver rings VQ doorbell. To use this feature, virtio-device must work in modern mode and page-per-vq must be set.

### Virtio-net and Virtio-blk properties
The offload properties (host_tso4, csum, ...) or VQ properties (tx_queue_size, rx_queue_size...) should be properly set to QEMU. Full property list can be get from following QEMU command:

[host]# qemu-system-x86_64 -device virtio-net,help
[host]# qemu-system-x86_64 -device vhost-user-blk-pci,help

### Vectors and MSIX
For virtio-net, guest driver will assign each TX queue or RX queue a dedicated MSIX interrupt **ONLY IF** device has enough MSIX resource. Make sure: QEMU virtio-net property **vectors = 2*N_qpair+2**, give QEMU emulated VF device enough MSIX resource.

### Jambo MTU
Configure virtio-net device MTU on DPU, set QEMU virtio-net device property `host_mtu` to the same value.

## Use presetup to reducing live migration downtime

QEMU can send out virtio device state in early live migration stage, so vhostd backend can apply device state before VM is down. This can reduce live migration downtime a lot. Only [NVIDIA QEMU][3](branch stable-8.1-presetup) has this feature.

Add `x-early-migration=on` in VM XML:

<qemu:arg value='-device'/>
<qemu:arg value='virtio-net-pci,netdev=vhost1,mac=00:00:00:00:33:00,vectors=10,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024,mq=on,disable-legacy=on,disable-modern=off,x-early-migration=on'/>

# Isolate DPA cores for both virtio-net controller and virtio-blk controller

## Limit the arm cores snap can use

Through doca_snap.yaml environment variable APP_ARGS:

env:
- name: APP_ARGS
value: "-m 0xfff"


## Seperate DPU resouce between virtio-net and virtio-blk

Example: vblk(3 cores) and vnet (8 cores).

In virtio-blk controller snap container yaml file /etc/kubelet.d/doca_snap.yaml:

env:
- name: dpa_virtq_split_core_mask
value: "0xffffffffffff"

In virtio-net controller configuration file /opt/mellanox/mlnx_virtnet/virtnet.conf:

{
"dpa_core_start": 3,
"dpa_core_end": 10
}

# Trouble shooting tips

* Error on host: `Can't bind virtio device to VFIO`.

Solution: Add intel_iommu=on iommu=pt to /proc/cmdline.

* Error on host: `vfio-pci 0001:86:00.3: can't enable 127 VFs (bus 87 out of range of [bus 86])`

Solution: Add pci=realloc,assign-busses to /proc/cmdline.

# Reference
For NVIDIA BlueField virtio-net PCIe devices and NVIDIA BlueField-3 virtio-blk configuration, Please refer lastest document in [BlueField DPUs / SuperNICs & DOCA](https://docs.nvidia.com/networking/dpu-doca/index.html#doca):

* [NVIDIA BlueField Virtio-net][1]
* [NVIDIA BlueField-3 SNAP for NVMe and Virtio-blk][2]

[1]: https://docs.nvidia.com/networking/display/bluefieldvirtionetv190
[2]: https://docs.nvidia.com/networking/display/bluefield3snap440
[3]: https://github.com/Mellanox/qemu

0 comments on commit 89696d2

Please sign in to comment.