Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vDPA High Availability #58

Merged
merged 26 commits into from
Mar 7, 2024
Merged

vDPA High Availability #58

merged 26 commits into from
Mar 7, 2024

Conversation

Ch3n60x
Copy link
Collaborator

@Ch3n60x Ch3n60x commented Feb 1, 2024

This PR introduces vDPA high availability.

@Ch3n60x Ch3n60x force-pushed the vdpa_ha_por branch 3 times, most recently from 6cb5c9d to 3075977 Compare February 7, 2024 02:26
@Ch3n60x Ch3n60x force-pushed the vdpa_ha_por branch 4 times, most recently from dc6364d to 4dc3ad0 Compare February 21, 2024 02:23
struct virtio_ha_pf_dev_list *list = &hs.pf_list;
struct virtio_ha_pf_dev *dev;

if (msg->nr_fds != 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should return error in this situation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

if (!found)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should return err here also

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const struct virtio_pf_ctx *pf_ctx = (const struct virtio_pf_ctx *)ctx;

memcpy(&cached_ctx.pf_name, pf, sizeof(struct virtio_dev_name));
cached_ctx.vfio_group_fd = pf_ctx->vfio_group_fd;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about use dev args to pass parameter?
like in rte_eal_hotplug_add("pci", pf_name, "vdpa=2");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it's an option and was discussed before. Parav does not want to add more devargs. I am fine with either option.

ret = virtio_ha_vf_ctx_set(&vf_list[j].vf_name, vf_ctx);
if (ret < 0) {
RTE_LOG(ERR, VDPA, "Failed to set vf ctx in vf driver\n");
return -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should free vf_ctx and vf_list if failed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

host_phys_addrs[i] = rte_mem_virt2phy((void *)(uintptr_t)vhost_reg->host_user_addr);
if (host_phys_addrs[i] == RTE_BAD_IOVA) {
DRV_LOG(ERR, "virt2phy translate failed");
return -1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free(cur_mem);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

This commit adds a new driver common library for virtio named virtio_ha.
This library is mainly for high availability (HA) purpose in vDPA use case.
It could be used by vDPA application, virtio PF/VF driver and HA
application that will be introduced later.

Basic APIs for vDPA application and PF/VF driver are introduced.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit adds the message definition for IPC client and server.
Besides, related APIs of alloc/free/reset/send/recv the IPC messages
are also defined.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit adds IPC client init API and supports disconnection
detection and reconnect mechanism.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
@Ch3n60x Ch3n60x force-pushed the vdpa_ha_por branch 2 times, most recently from 7e1bbd7 to 6fedcf7 Compare February 27, 2024 03:52
return 0;
}

priv->ctx_stored = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

priv->ctx_stored should be flase after dev_close

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

This commit adds a new high availability application for vDPA
application. This HA application is mainly for storing device context
from vDPA application. It has an IPC server and in-memory database.
The HA application uses the HA library for the IPC server and in-memory
database.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit adds definition of IPC client APIs. These APIs are mainly
for query/store/remove context from/to HA service.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit adds API definition of set/unset context in PF/VF driver and
new register API for drivers to register set/unset callbacks to HA
library.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit adds the HA restart module in vDPA application. The restart
flow supports working with HA service or not. When HA service is there,
it will query information from HA service and try to restore devices.
When HA service is not alive, it will skip the restore when calling
virtio_ha_pf_list_query().

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit adds several APIs in EAL library and PCI bus to restore
VFIO container/group/device FD. VFIO layer reset is also avoided in
new restore API.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit adds a new parameter, which is the VFIO device FD to
virtio_pci_dev_alloc(). When VFIO device FD is set to -1, device
alloc is the same as before. If not, the device FD will be restored
to EAL and PCI bus.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit adds support of device context store and remove in
virtio PF driver. When device context is set from application,
the driver will restore from the context and will not store context.
Otherwise, the driver will initialize everything and store the
context. The store and restore happens in driver probe and remove
happens in driver remove.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit redefines DMA memory region in virtio vdpa driver.
Initially, the driver is using vhost memory regions. However, vhost
memory region definiton has some part that is not needed for the driver
and host physical address is needed for checking old/new region is the
same or not when QEMU restarts. For that purpose, new DMA memory
region structure is defined in this commit.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit supports device context store/remove/restore in the
VF driver to achieve HA.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
When HA service is alive, it holds the DMA mapping of the hugepages.
There is one corner case that QEMU could fail to restart:

1. vDPA application crashes
2. QEMU crashes
3. vDPA application restarts and restore all device context
4. QEMU tries to restart but failed due to not enough hugepages (HA
service holding the hugepages).

To resolve this potential issue, this commit adds a connection timeout
mechanism in vhost library and vdpa driver, which cleans up the DMA
mapping of the hugepages when vDPA application can't connect to QEMU
for 3 seconds.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
Ch3n60x and others added 12 commits March 1, 2024 06:08
When vDPA application store or remove context to HA service, it could
happen that HA service is not alive. When HA service becomes alive
after store or remove finishes, we need a mechanism to sync the context
to HA service. In this commit, a cache layer is introduced on the IPC
client side which saves all context information in HA library. Then a sync
mechanism is introduced to sync to context when HA library detects that
HA service is alive again.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
Now application quit or RPC device removal will both call driver
cleanup and remove, then device context will be removed. This is
not expected behavior. When application quits, device context
should not be removed. Therefore, this commit adds two different
behaviors for application quit and RPC device removal.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
Now application quit or RPC device removal will both call driver
remove, then device context will be removed. This is not expected
behavior. When application quits, device context should not be
removed. Therefore, this commit adds two different behaviors for
application quit and RPC device removal.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
Only saving VFIO group and device fd of PF has some restore problem:
after container fd releases, the group status will still be in container
set so that DPDK can't rebind the group to another container. Therefore,
saving the global container fd and corresponding DMA map is needed for
PF restore.

This commit adds the store/remove/restore behavior for PF. With some
new APIs introduced in HA lib, EAL layer could store the global
container fd and DMA map for PF. The container fd will be saved in HA
service as long as it's alive and DMA map is stored by EAL layer and
released upon HA socket disconnection, which means vhost-user service
crashed or gracefully restarted.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
IOMMU domain should be freed only when no VF is in the container. This
commit fixes the wrong free of iommu domain.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
In vhost library, there exists a mechanism to reallocate vq resources
when numa node of VM memory is different from that of current vq memory.
The reallocation could use much time (more than 200ms as measured). This
leads to vDPA application restart very slowly. So this commit delete it.

The impact should be minimal as it will not impact date path performance
when using vDPA. It could increase the dirty page logging time as the
DMA of dirty page logging could be cross-numa. So it could increase the
migration time but not migration downtime, which is more important.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
This commit supports systemd service for virtio-ha application. Related
service.in file is added and the service will be installed to systemd
folder (e.g., /usr/lib/systemd/system/) as service vfe-vhostd-ha.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
Install:
/usr/lib/systemd/system/vfe-vhostd-ha.service

Meson build:
-Denable_drivers += ,common/virtio_ha

Signed-off-by: Yajun Wu <yajunw@nvidia.com>
When graceful restart, we want to keep the traffic when SIGTERM is
received. Therefore we can not remove the device.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
When restoring from HA, it's noticed that during virtio resetting,
sending traffic to VM will corrupt the VM driver or kernel. So skip
reset when restoring.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
DPDK already sends log to syslog, no need let systemd forward log.
For HA feature, need use fixed vf-token.

Signed-off-by: Yajun Wu <yajunw@nvidia.com>
When multiple VFs need to be restored, per-VF restore time will be
longer because of QEMU clean-up and device config takes more time
for one VF. Therefore, add a sleep interval of 4 seconds for one VF
so that downtime for one VF will be stable.

Signed-off-by: Chenbo Xia <chenbox@nvidia.com>
It's noticed that when using iperf to test multiple queue, restart
with HA may lead to some queue hang. This issue could be fixed by
sending per-vq interrupt to guest in dev_config.

Signed-off-by: Kailiang Zhou <kailiangz@nvidia.com>
@kailiangz1 kailiangz1 merged commit 44edd6f into Mellanox:main Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants