-
-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement nvme driver #1284
Implement nvme driver #1284
Conversation
The implemented driver allows OSv to boot from a emulated nvme device and handle additional nvme devices, emulated or passed through from the host. Booting directly from a nvme via pci-passthrough needs to be tested. The nvme driver creates nvme_queue_pairs to interact with the device controller. A nvme_queue_pair manages a submission queue and the corresponding completion queue. The nvme driver registers every namespace as device and forwards requests to the queues. Namespace 1 on the first nvme drive is named "vblk0". Further devices are named nvmeXnY where X is the driver instance id starting with 0 and Y is the namespace id starting with 1. Read/Write requests on the device file go through the block cache layer. This can reduce performance quite a bit, since the block cache splits every request into 512B-sized sequentiell requests. Setting NVME_DIRECT_RW_ENABLED in /drivers/nvme.hh disables the block cache. All queues use MSI-X. 1 interrupt vector gets registered for every queue. There is very noticeable overhead while using pci-passthrough. This gets reduced by using interrupt coalescing but needs to be further investigated. Add options to ./scripts/run.py: --nvme to start OSv on a nvme emulated by QEMU --second-nvme-image to attach an additional image as nvme --pass-pci to passthrough a pci device from the host the device needs to be bound to vfio-pci drivers/blk_ioctl.hh implements the BLKGETSIZE64 and BLKFLSBUF ioctl which are used by fio drivers/io-test.cc is an simple iops test that can be activated by building with conf_drivers_io_test=1 and runs during initialization of the nvme device Signed-off-by: Jan Braunwarth <jan.braunwarth@gmail.com>
Hi Jan, I am very excited to see your contribution and very eager to eventually make it part of OSv! Let me make some general comments, ask some questions here, and then start a high-level review. Could you provide any performance test results and ways to replicate those? I see you mention In general, as long as the driver is correct and gets reviewed, and any potential issues are addressed, I think we should merge it in regardless of performance unless it is unusable or some deeper design issues are found in your implementation. Honestly, I am not surprised by OSv's poor disk I/O performance (see #763, #450, b5eadc3 that was rolled back, and this https://groups.google.com/g/osv-dev/c/0qLTPtHbNYg/m/-kbc3xWiBgAJ) especially given this block cache issues you came across which I am familiar with. I wonder if you have had a chance to test your NVME driver with ROFS (fs=rofs when building an image) which implements its read-around cache (https://github.com/cloudius-systems/osv/blob/master/fs/rofs/rofs_cache.cc#L18) that bypasses the kern cache and interacts with the driver directly via Line 74 in f11c021
Regarding the tests, there are a number of the disk I/IO tests under I suggest you create a separate PR to address the Finally, have you had a chance to test this driver on a Nitro EC2 instance? For that, you would probably need to apply for this ENA driver PR (see #1283). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left some comments and questions. Nothing major.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a copyright statement to all your files (see this one for example -
Lines 1 to 7 in f11c021
/* | |
* Copyright (C) 2017 Waldemar Kozaczuk | |
* Inspired by original MFS implementation by James Root from 2015 | |
* | |
* This work is open source software, licensed under the terms of the | |
* BSD license as described in the LICENSE file in the top-level directory. | |
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please describe in comments what this test does and how one can run it? How different is it from various disk I/O-related tests located under tests/misc-*cc
?
I also suggest we move this to the tests/
folder.
max_ios = 1 << 30; | ||
|
||
printf("Start IO test dev : %s, IO size : %d\n",dev->name,io_size); | ||
sched::thread *t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of OSv internal API to create threads I would suggest using the standard std::thread
API so that the test can be built and run on Linux so that we can easily compare the results. The same applies to other OSv-specific API (memory::alloc_phys_contiguous_aligned
). Unless there is a very specific reason to use those because the test is OSv specific or tests OSv internal API.
sched::thread::wait_until([this] { | ||
bool have_elements = this->completion_queue_not_empty(); | ||
if (!have_elements) { | ||
this->enable_interrupts(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to enable interrupts here and then only disable them if have_elements
is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition gets run after we finished handling completion queue entries (cqes) at the beginning of the loop in req_done
Line 291 in 2b84047
wait_for_completion_queue_entries(); |
At least for the disk i tested, the second check was unnecessary. If we got new cqes the nvme controller will send an interrupt as soon as the interrupt vector gets unmasked. I left it in because i wasn't totally sure that this is standard behavior.
The interrupt handling is analog to the one in the virtio https://github.com/cloudius-systems/osv/blob/f11c0210d50e2534932beaae03360bd5b4810e81/drivers/virtio.cc#L156C1-L176C2
}); | ||
} | ||
|
||
int nvme_queue_pair::map_prps(u16 cid, void* data, u64 datasize, u64* prp1, u64* prp2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this method do? prps
is somewhat cryptic. Maybe a different name or some comment would help.
{ | ||
nvme_controller_config_t cc; | ||
cc.val = mmio_getl(&_control_reg->cc.val); | ||
cc.iocqes = 4; // completion queue entry size 16B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these comments match the values? Can we have consts here?
} | ||
|
||
void nvme::create_io_queues_foreach_cpu() | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see here a potential for a lot of code re-use between create_io_queue()
and create_io_queues_foreach_cpu()
. Can we have some helper method to implement common logic?
|
||
_admin_queue->submit_and_return_on_completion(std::move(cmd), (void*) mmu::virt_to_phys(active_namespaces), 4096); | ||
int err; | ||
for(int i=0; i < 1024; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 1024? Can we have a const for this?
if(sched::current_cpu->id >= _io_queues.size()) | ||
return _io_queues[0]->make_request(bio, nsid); | ||
|
||
return _io_queues[sched::current_cpu->id]->make_request(bio, nsid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this per-cpu queue setup do we have to lock in make_request()
? Do we handle scenarios when we have fewer or more queues than cpus? Should we do something like modulo sched::cpus.size()
?
printf("admin interrupt registration failed\n"); | ||
} | ||
|
||
bool nvme::msix_register(unsigned iv, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code seems to have a lot of overlap with arch/x64/msi.cc
. Any reason not to re-use it?
This patch implements the NVMe block device driver. It is greatly based on the pull request submitted by Jan Braunwarth (see cloudius-systems#1284) so most credit goes to Jan. As his PR explains, OSv can be started with emulated NVMe disk on QEMU like so: ./scripts/run.py --nvme Compared to the Jan's PR, this patch is different in following ways: - removes all non-NVMe changes (various bug fixes or ioctl enhancements are part of separate PRs) - replaces most of the heap allocations by using stack which should reduce some contention - tweaks PRP-handling code to use lock-less ring buffer which should further reduce contention when allocating memory - fixes a bug in I/O queue CQ handling to correctly determine if SQ is not full - assumes single namespace - 1 (most logic to deal with more has been preserved) - reduces I/O queue size to 64 instead of 256 - makes code a little more DRY Please note that, as Jan points out, the block cache logic of splitting reads and writes into 512-byte requests causes very poor performance when stress testing at devfs level. However, this behavior is not NVMe specific and does not affect most applications that go through a VFS and filesystem driver (ZFS, EXT, ROFS) which use the strategy() method which does not use block cache. Based on my tests, the NVMe read performance (IOPs and bytes/s) is 60-70% of the virtio-blk on QEMU. I am not sure how much that is because of this implementation of the NVMe driver or it is because virtio-blk is by design much faster than anything emulated including NVMe. Closes cloudius-systems#1203 Signed-off-by: Jan Braunwarth <jan.braunwarth@gmail.com> Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch implements the NVMe block device driver. It is greatly based on the pull request submitted by Jan Braunwarth (see #1284) so most credit goes to Jan. As his PR explains, OSv can be started with emulated NVMe disk on QEMU like so: ./scripts/run.py --nvme Compared to the Jan's PR, this patch is different in following ways: - removes all non-NVMe changes (various bug fixes or ioctl enhancements are part of separate PRs) - replaces most of the heap allocations by using stack which should reduce some contention - tweaks PRP-handling code to use lock-less ring buffer which should further reduce contention when allocating memory - fixes a bug in I/O queue CQ handling to correctly determine if SQ is not full - assumes single namespace - 1 (most logic to deal with more has been preserved) - reduces I/O queue size to 64 instead of 256 - makes code a little more DRY Please note that, as Jan points out, the block cache logic of splitting reads and writes into 512-byte requests causes very poor performance when stress testing at devfs level. However, this behavior is not NVMe specific and does not affect most applications that go through a VFS and filesystem driver (ZFS, EXT, ROFS) which use the strategy() method which does not use block cache. Based on my tests, the NVMe read performance (IOPs and bytes/s) is 60-70% of the virtio-blk on QEMU. I am not sure how much that is because of this implementation of the NVMe driver or it is because virtio-blk is by design much faster than anything emulated including NVMe. Closes #1203 Signed-off-by: Jan Braunwarth <jan.braunwarth@gmail.com> Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
I am closing this PR as another one based on it - #1317 - has been merged into the master branch. |
I implemented a nvme driver for OSv as part of my bachelor thesis and like to share the results here.
I don't think it's ready for merging, since the performance is far from the theoretical possible iops compared to an ubuntu vm. But i hope it still can serve as a base for future implementations.
The patch also fixes a simple bug that prevented addressing data with an offset greater than 2GB through the block cache. Cause: the block offset was saved in an int and later used to reconstruct the byte offset. The simple fix in buf.h should be integrated, even if the rest of the patch isn't.
I tested with fio and the io-test included in the patch; on qemu emulated and via pci-passthrough. On real hardware interrupts lead to a noticeable overhead, resulting in better iops on the qemu emulated nvme. Interrupt coalescing reduces the amount of interrupts which results in similar iops but this can only be seen as work-arround. I'm not sure what goes wrong here.
Another factor in the disappointing performance is the block cache. Splitting every read/write into 512 byte-sized requests is not advisable. Most modern nvmes internally use 4k blocks and splitting into 512 byte blocks will result in 8+ times worse iops. Especially since all the resulting requests get done one after another. I included a read/write that doesn't use the block cache and doesn't copy the data; therefor does not work on stack addresses (virt_to_phys fails on these).