Implement nvme driver #1284

JanBraunwarth · 2023-12-09T12:51:37Z

I implemented a nvme driver for OSv as part of my bachelor thesis and like to share the results here.
I don't think it's ready for merging, since the performance is far from the theoretical possible iops compared to an ubuntu vm. But i hope it still can serve as a base for future implementations.

The patch also fixes a simple bug that prevented addressing data with an offset greater than 2GB through the block cache. Cause: the block offset was saved in an int and later used to reconstruct the byte offset. The simple fix in buf.h should be integrated, even if the rest of the patch isn't.

I tested with fio and the io-test included in the patch; on qemu emulated and via pci-passthrough. On real hardware interrupts lead to a noticeable overhead, resulting in better iops on the qemu emulated nvme. Interrupt coalescing reduces the amount of interrupts which results in similar iops but this can only be seen as work-arround. I'm not sure what goes wrong here.

Another factor in the disappointing performance is the block cache. Splitting every read/write into 512 byte-sized requests is not advisable. Most modern nvmes internally use 4k blocks and splitting into 512 byte blocks will result in 8+ times worse iops. Especially since all the resulting requests get done one after another. I included a read/write that doesn't use the block cache and doesn't copy the data; therefor does not work on stack addresses (virt_to_phys fails on these).

The implemented driver allows OSv to boot from a emulated nvme device and handle additional nvme devices, emulated or passed through from the host. Booting directly from a nvme via pci-passthrough needs to be tested. The nvme driver creates nvme_queue_pairs to interact with the device controller. A nvme_queue_pair manages a submission queue and the corresponding completion queue. The nvme driver registers every namespace as device and forwards requests to the queues. Namespace 1 on the first nvme drive is named "vblk0". Further devices are named nvmeXnY where X is the driver instance id starting with 0 and Y is the namespace id starting with 1. Read/Write requests on the device file go through the block cache layer. This can reduce performance quite a bit, since the block cache splits every request into 512B-sized sequentiell requests. Setting NVME_DIRECT_RW_ENABLED in /drivers/nvme.hh disables the block cache. All queues use MSI-X. 1 interrupt vector gets registered for every queue. There is very noticeable overhead while using pci-passthrough. This gets reduced by using interrupt coalescing but needs to be further investigated. Add options to ./scripts/run.py: --nvme to start OSv on a nvme emulated by QEMU --second-nvme-image to attach an additional image as nvme --pass-pci to passthrough a pci device from the host the device needs to be bound to vfio-pci drivers/blk_ioctl.hh implements the BLKGETSIZE64 and BLKFLSBUF ioctl which are used by fio drivers/io-test.cc is an simple iops test that can be activated by building with conf_drivers_io_test=1 and runs during initialization of the nvme device Signed-off-by: Jan Braunwarth <jan.braunwarth@gmail.com>

wkozaczuk · 2023-12-11T22:49:42Z

Hi Jan,

I am very excited to see your contribution and very eager to eventually make it part of OSv!

Let me make some general comments, ask some questions here, and then start a high-level review.

Could you provide any performance test results and ways to replicate those? I see you mention fio and your io-test but it would be interesting to see concrete numbers and/or possibly output from these and a description of the test setup.

In general, as long as the driver is correct and gets reviewed, and any potential issues are addressed, I think we should merge it in regardless of performance unless it is unusable or some deeper design issues are found in your implementation.

Honestly, I am not surprised by OSv's poor disk I/O performance (see #763, #450, b5eadc3 that was rolled back, and this https://groups.google.com/g/osv-dev/c/0qLTPtHbNYg/m/-kbc3xWiBgAJ) especially given this block cache issues you came across which I am familiar with. I wonder if you have had a chance to test your NVME driver with ROFS (fs=rofs when building an image) which implements its read-around cache (https://github.com/cloudius-systems/osv/blob/master/fs/rofs/rofs_cache.cc#L18) that bypasses the kern cache and interacts with the driver directly via strategy (see

osv/fs/rofs/rofs_common.cc

Line 74 in f11c021

bio->bio_dev->driver->devops->strategy(bio);

) and reads 32K at once.

Regarding the tests, there are a number of the disk I/IO tests under tests/misc*c that may be of use (misc-zfs-, misc-bdev*, misc-concurrent-io.cc, misc-fs-stress.cc and possibly others):

I suggest you create a separate PR to address the buf.h bug. Can you describe how one can replicate it? What would the crash look like?

Finally, have you had a chance to test this driver on a Nitro EC2 instance? For that, you would probably need to apply for this ENA driver PR (see #1283).

wkozaczuk

I have left some comments and questions. Nothing major.

wkozaczuk · 2023-12-11T22:52:54Z

drivers/blk_ioctl.hh

Could you please add a copyright statement to all your files (see this one for example -

osv/fs/rofs/rofs_cache.cc

Lines 1 to 7 in f11c021

/*

* Copyright (C) 2017 Waldemar Kozaczuk

* Inspired by original MFS implementation by James Root from 2015

*

* This work is open source software, licensed under the terms of the

* BSD license as described in the LICENSE file in the top-level directory.

*/

).

wkozaczuk · 2023-12-11T22:56:51Z

drivers/io-test.cc

Could you please describe in comments what this test does and how one can run it? How different is it from various disk I/O-related tests located under tests/misc-*cc?

I also suggest we move this to the tests/ folder.

wkozaczuk · 2023-12-11T23:02:25Z

drivers/io-test.cc

+    max_ios = 1 << 30;
+
+    printf("Start IO test dev : %s, IO size : %d\n",dev->name,io_size);
+    sched::thread *t;


Instead of OSv internal API to create threads I would suggest using the standard std::thread API so that the test can be built and run on Linux so that we can easily compare the results. The same applies to other OSv-specific API (memory::alloc_phys_contiguous_aligned). Unless there is a very specific reason to use those because the test is OSv specific or tests OSv internal API.

wkozaczuk · 2023-12-11T23:05:05Z

drivers/nvme-queue.cc

+    sched::thread::wait_until([this] {
+        bool have_elements = this->completion_queue_not_empty();
+        if (!have_elements) {
+            this->enable_interrupts();


Why do we need to enable interrupts here and then only disable them if have_elements is true?

The condition gets run after we finished handling completion queue entries (cqes) at the beginning of the loop in req_done

osv/drivers/nvme-queue.cc

Line 291 in 2b84047

wait_for_completion_queue_entries();

. Interrupts should be disabled while we handle cqes, so after we finished handling cqes we need to enable interrupts again. If have_elements is true, we will continue in req_done so we need to disable interrupts again.
At least for the disk i tested, the second check was unnecessary. If we got new cqes the nvme controller will send an interrupt as soon as the interrupt vector gets unmasked. I left it in because i wasn't totally sure that this is standard behavior.

The interrupt handling is analog to the one in the virtio https://github.com/cloudius-systems/osv/blob/f11c0210d50e2534932beaae03360bd5b4810e81/drivers/virtio.cc#L156C1-L176C2

wkozaczuk · 2023-12-11T23:08:08Z

drivers/nvme-queue.cc

+    });
+}
+
+int nvme_queue_pair::map_prps(u16 cid, void* data, u64 datasize, u64* prp1, u64* prp2)


What does this method do? prps is somewhat cryptic. Maybe a different name or some comment would help.

wkozaczuk · 2023-12-12T01:14:13Z

drivers/nvme.cc

+{
+    nvme_controller_config_t cc;
+    cc.val = mmio_getl(&_control_reg->cc.val);
+    cc.iocqes    = 4;  // completion queue entry size 16B


Do these comments match the values? Can we have consts here?

wkozaczuk · 2023-12-12T01:16:57Z

drivers/nvme.cc

+}
+
+void nvme::create_io_queues_foreach_cpu()
+{


I see here a potential for a lot of code re-use between create_io_queue() and create_io_queues_foreach_cpu(). Can we have some helper method to implement common logic?

wkozaczuk · 2023-12-12T01:18:17Z

drivers/nvme.cc

+
+    _admin_queue->submit_and_return_on_completion(std::move(cmd), (void*) mmu::virt_to_phys(active_namespaces), 4096);
+    int err;
+    for(int i=0; i < 1024; i++) {


Why 1024? Can we have a const for this?

wkozaczuk · 2023-12-12T01:21:51Z

drivers/nvme.cc

+    if(sched::current_cpu->id >= _io_queues.size())
+        return _io_queues[0]->make_request(bio, nsid);
+
+    return _io_queues[sched::current_cpu->id]->make_request(bio, nsid);


In this per-cpu queue setup do we have to lock in make_request()? Do we handle scenarios when we have fewer or more queues than cpus? Should we do something like modulo sched::cpus.size()?

wkozaczuk · 2023-12-12T01:22:58Z

drivers/nvme.cc

+        printf("admin interrupt registration failed\n");
+}
+
+bool nvme::msix_register(unsigned iv,


This code seems to have a lot of overlap with arch/x64/msi.cc. Any reason not to re-use it?

This patch implements the NVMe block device driver. It is greatly based on the pull request submitted by Jan Braunwarth (see cloudius-systems#1284) so most credit goes to Jan. As his PR explains, OSv can be started with emulated NVMe disk on QEMU like so: ./scripts/run.py --nvme Compared to the Jan's PR, this patch is different in following ways: - removes all non-NVMe changes (various bug fixes or ioctl enhancements are part of separate PRs) - replaces most of the heap allocations by using stack which should reduce some contention - tweaks PRP-handling code to use lock-less ring buffer which should further reduce contention when allocating memory - fixes a bug in I/O queue CQ handling to correctly determine if SQ is not full - assumes single namespace - 1 (most logic to deal with more has been preserved) - reduces I/O queue size to 64 instead of 256 - makes code a little more DRY Please note that, as Jan points out, the block cache logic of splitting reads and writes into 512-byte requests causes very poor performance when stress testing at devfs level. However, this behavior is not NVMe specific and does not affect most applications that go through a VFS and filesystem driver (ZFS, EXT, ROFS) which use the strategy() method which does not use block cache. Based on my tests, the NVMe read performance (IOPs and bytes/s) is 60-70% of the virtio-blk on QEMU. I am not sure how much that is because of this implementation of the NVMe driver or it is because virtio-blk is by design much faster than anything emulated including NVMe. Closes cloudius-systems#1203 Signed-off-by: Jan Braunwarth <jan.braunwarth@gmail.com> Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>

This patch implements the NVMe block device driver. It is greatly based on the pull request submitted by Jan Braunwarth (see #1284) so most credit goes to Jan. As his PR explains, OSv can be started with emulated NVMe disk on QEMU like so: ./scripts/run.py --nvme Compared to the Jan's PR, this patch is different in following ways: - removes all non-NVMe changes (various bug fixes or ioctl enhancements are part of separate PRs) - replaces most of the heap allocations by using stack which should reduce some contention - tweaks PRP-handling code to use lock-less ring buffer which should further reduce contention when allocating memory - fixes a bug in I/O queue CQ handling to correctly determine if SQ is not full - assumes single namespace - 1 (most logic to deal with more has been preserved) - reduces I/O queue size to 64 instead of 256 - makes code a little more DRY Please note that, as Jan points out, the block cache logic of splitting reads and writes into 512-byte requests causes very poor performance when stress testing at devfs level. However, this behavior is not NVMe specific and does not affect most applications that go through a VFS and filesystem driver (ZFS, EXT, ROFS) which use the strategy() method which does not use block cache. Based on my tests, the NVMe read performance (IOPs and bytes/s) is 60-70% of the virtio-blk on QEMU. I am not sure how much that is because of this implementation of the NVMe driver or it is because virtio-blk is by design much faster than anything emulated including NVMe. Closes #1203 Signed-off-by: Jan Braunwarth <jan.braunwarth@gmail.com> Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>

wkozaczuk · 2024-08-20T21:34:40Z

I am closing this PR as another one based on it - #1317 - has been merged into the master branch.

wkozaczuk requested changes Dec 12, 2023

View reviewed changes

wkozaczuk mentioned this pull request Jan 8, 2024

Implement NVMe driver #1203

Closed

wkozaczuk mentioned this pull request Jun 18, 2024

Implement NVMe driver #1317

Merged

wkozaczuk closed this Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement nvme driver #1284

Implement nvme driver #1284

JanBraunwarth commented Dec 9, 2023

wkozaczuk commented Dec 11, 2023 •

edited

Loading

wkozaczuk left a comment

wkozaczuk Dec 11, 2023

wkozaczuk Dec 11, 2023

wkozaczuk Dec 11, 2023

wkozaczuk Dec 11, 2023

JanBraunwarth Dec 13, 2023 •

edited

Loading

wkozaczuk Dec 11, 2023

wkozaczuk Dec 12, 2023

wkozaczuk Dec 12, 2023

wkozaczuk Dec 12, 2023

wkozaczuk Dec 12, 2023

wkozaczuk Dec 12, 2023

wkozaczuk commented Aug 20, 2024

	/*
	* Copyright (C) 2017 Waldemar Kozaczuk
	* Inspired by original MFS implementation by James Root from 2015
	*
	* This work is open source software, licensed under the terms of the
	* BSD license as described in the LICENSE file in the top-level directory.
	*/

Implement nvme driver #1284

Implement nvme driver #1284

Conversation

JanBraunwarth commented Dec 9, 2023

wkozaczuk commented Dec 11, 2023 • edited Loading

wkozaczuk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JanBraunwarth Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wkozaczuk commented Aug 20, 2024

wkozaczuk commented Dec 11, 2023 •

edited

Loading

JanBraunwarth Dec 13, 2023 •

edited

Loading