Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement nvme driver #1284

Closed
wants to merge 1 commit into from

Conversation

JanBraunwarth
Copy link

I implemented a nvme driver for OSv as part of my bachelor thesis and like to share the results here.
I don't think it's ready for merging, since the performance is far from the theoretical possible iops compared to an ubuntu vm. But i hope it still can serve as a base for future implementations.

The patch also fixes a simple bug that prevented addressing data with an offset greater than 2GB through the block cache. Cause: the block offset was saved in an int and later used to reconstruct the byte offset. The simple fix in buf.h should be integrated, even if the rest of the patch isn't.

I tested with fio and the io-test included in the patch; on qemu emulated and via pci-passthrough. On real hardware interrupts lead to a noticeable overhead, resulting in better iops on the qemu emulated nvme. Interrupt coalescing reduces the amount of interrupts which results in similar iops but this can only be seen as work-arround. I'm not sure what goes wrong here.

Another factor in the disappointing performance is the block cache. Splitting every read/write into 512 byte-sized requests is not advisable. Most modern nvmes internally use 4k blocks and splitting into 512 byte blocks will result in 8+ times worse iops. Especially since all the resulting requests get done one after another. I included a read/write that doesn't use the block cache and doesn't copy the data; therefor does not work on stack addresses (virt_to_phys fails on these).

The implemented driver allows OSv to boot from a emulated nvme device and
handle additional nvme devices, emulated or passed through from the host.
Booting directly from a nvme via pci-passthrough needs to be tested.

The nvme driver creates nvme_queue_pairs to interact with the device
controller. A nvme_queue_pair manages a submission queue and the
corresponding completion queue.
The nvme driver registers every namespace as device and forwards requests
to the queues. Namespace 1 on the first nvme drive is named "vblk0".
Further devices are named nvmeXnY where X is the driver instance id
starting with 0 and Y is the namespace id starting with 1.

Read/Write requests on the device file go through the block cache layer.
This can reduce performance quite a bit, since the block cache splits
every request into 512B-sized sequentiell requests. Setting
NVME_DIRECT_RW_ENABLED in /drivers/nvme.hh disables the block cache.

All queues use MSI-X. 1 interrupt vector gets registered for every queue.
There is very noticeable overhead while using pci-passthrough. This gets
reduced by using interrupt coalescing but needs to be further investigated.

Add options to ./scripts/run.py:
--nvme to start OSv on a nvme emulated by QEMU
--second-nvme-image to attach an additional image as nvme
--pass-pci to passthrough a pci device from the host
    the device needs to be bound to vfio-pci

drivers/blk_ioctl.hh implements the BLKGETSIZE64 and BLKFLSBUF
ioctl which are used by fio

drivers/io-test.cc is an simple iops test that can be activated by
building with conf_drivers_io_test=1 and runs during initialization
of the nvme device

Signed-off-by: Jan Braunwarth <jan.braunwarth@gmail.com>
@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Dec 11, 2023

Hi Jan,

I am very excited to see your contribution and very eager to eventually make it part of OSv!

Let me make some general comments, ask some questions here, and then start a high-level review.

Could you provide any performance test results and ways to replicate those? I see you mention fio and your io-test but it would be interesting to see concrete numbers and/or possibly output from these and a description of the test setup.

In general, as long as the driver is correct and gets reviewed, and any potential issues are addressed, I think we should merge it in regardless of performance unless it is unusable or some deeper design issues are found in your implementation.

Honestly, I am not surprised by OSv's poor disk I/O performance (see #763, #450, b5eadc3 that was rolled back, and this https://groups.google.com/g/osv-dev/c/0qLTPtHbNYg/m/-kbc3xWiBgAJ) especially given this block cache issues you came across which I am familiar with. I wonder if you have had a chance to test your NVME driver with ROFS (fs=rofs when building an image) which implements its read-around cache (https://github.com/cloudius-systems/osv/blob/master/fs/rofs/rofs_cache.cc#L18) that bypasses the kern cache and interacts with the driver directly via strategy (see

bio->bio_dev->driver->devops->strategy(bio);
) and reads 32K at once.

Regarding the tests, there are a number of the disk I/IO tests under tests/misc*c that may be of use (misc-zfs-, misc-bdev*, misc-concurrent-io.cc, misc-fs-stress.cc and possibly others):

I suggest you create a separate PR to address the buf.h bug. Can you describe how one can replicate it? What would the crash look like?

Finally, have you had a chance to test this driver on a Nitro EC2 instance? For that, you would probably need to apply for this ENA driver PR (see #1283).

Copy link
Collaborator

@wkozaczuk wkozaczuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments and questions. Nothing major.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a copyright statement to all your files (see this one for example -

/*
* Copyright (C) 2017 Waldemar Kozaczuk
* Inspired by original MFS implementation by James Root from 2015
*
* This work is open source software, licensed under the terms of the
* BSD license as described in the LICENSE file in the top-level directory.
*/
).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please describe in comments what this test does and how one can run it? How different is it from various disk I/O-related tests located under tests/misc-*cc?

I also suggest we move this to the tests/ folder.

max_ios = 1 << 30;

printf("Start IO test dev : %s, IO size : %d\n",dev->name,io_size);
sched::thread *t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of OSv internal API to create threads I would suggest using the standard std::thread API so that the test can be built and run on Linux so that we can easily compare the results. The same applies to other OSv-specific API (memory::alloc_phys_contiguous_aligned). Unless there is a very specific reason to use those because the test is OSv specific or tests OSv internal API.

sched::thread::wait_until([this] {
bool have_elements = this->completion_queue_not_empty();
if (!have_elements) {
this->enable_interrupts();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to enable interrupts here and then only disable them if have_elements is true?

Copy link
Author

@JanBraunwarth JanBraunwarth Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition gets run after we finished handling completion queue entries (cqes) at the beginning of the loop in req_done

wait_for_completion_queue_entries();
. Interrupts should be disabled while we handle cqes, so after we finished handling cqes we need to enable interrupts again. If have_elements is true, we will continue in req_done so we need to disable interrupts again.
At least for the disk i tested, the second check was unnecessary. If we got new cqes the nvme controller will send an interrupt as soon as the interrupt vector gets unmasked. I left it in because i wasn't totally sure that this is standard behavior.

The interrupt handling is analog to the one in the virtio https://github.com/cloudius-systems/osv/blob/f11c0210d50e2534932beaae03360bd5b4810e81/drivers/virtio.cc#L156C1-L176C2

});
}

int nvme_queue_pair::map_prps(u16 cid, void* data, u64 datasize, u64* prp1, u64* prp2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this method do? prps is somewhat cryptic. Maybe a different name or some comment would help.

{
nvme_controller_config_t cc;
cc.val = mmio_getl(&_control_reg->cc.val);
cc.iocqes = 4; // completion queue entry size 16B
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these comments match the values? Can we have consts here?

}

void nvme::create_io_queues_foreach_cpu()
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see here a potential for a lot of code re-use between create_io_queue() and create_io_queues_foreach_cpu(). Can we have some helper method to implement common logic?


_admin_queue->submit_and_return_on_completion(std::move(cmd), (void*) mmu::virt_to_phys(active_namespaces), 4096);
int err;
for(int i=0; i < 1024; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1024? Can we have a const for this?

if(sched::current_cpu->id >= _io_queues.size())
return _io_queues[0]->make_request(bio, nsid);

return _io_queues[sched::current_cpu->id]->make_request(bio, nsid);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this per-cpu queue setup do we have to lock in make_request()? Do we handle scenarios when we have fewer or more queues than cpus? Should we do something like modulo sched::cpus.size()?

printf("admin interrupt registration failed\n");
}

bool nvme::msix_register(unsigned iv,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems to have a lot of overlap with arch/x64/msi.cc. Any reason not to re-use it?

@wkozaczuk wkozaczuk mentioned this pull request Jan 8, 2024
wkozaczuk added a commit to wkozaczuk/osv that referenced this pull request Jun 18, 2024
This patch implements the NVMe block device driver. It is greatly based
on the pull request submitted by Jan Braunwarth (see
cloudius-systems#1284) so most credit goes
to Jan.

As his PR explains, OSv can be started with emulated NVMe disk on QEMU
like so:

./scripts/run.py --nvme

Compared to the Jan's PR, this patch is different in following ways:
- removes all non-NVMe changes (various bug fixes or ioctl enhancements
  are part of separate PRs)
- replaces most of the heap allocations by using stack which should
  reduce some contention
- tweaks PRP-handling code to use lock-less ring buffer which should
  further reduce contention when allocating memory
- fixes a bug in I/O queue CQ handling to correctly determine if SQ is
  not full
- assumes single namespace - 1 (most logic to deal with more has been
  preserved)
- reduces I/O queue size to 64 instead of 256
- makes code a little more DRY

Please note that, as Jan points out, the block cache logic of splitting
reads and writes into 512-byte requests causes very poor performance when
stress testing at devfs level. However, this behavior is not NVMe
specific and does not affect most applications that go through a
VFS and filesystem driver (ZFS, EXT, ROFS) which use the strategy()
method which does not use block cache.

Based on my tests, the NVMe read performance (IOPs and
bytes/s) is 60-70% of the virtio-blk on QEMU. I am not sure how much
that is because of this implementation of the NVMe driver or it is
because virtio-blk is by design much faster than anything emulated
including NVMe.

Closes cloudius-systems#1203

Signed-off-by: Jan Braunwarth <jan.braunwarth@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
@wkozaczuk wkozaczuk mentioned this pull request Jun 18, 2024
wkozaczuk added a commit that referenced this pull request Jun 22, 2024
This patch implements the NVMe block device driver. It is greatly based
on the pull request submitted by Jan Braunwarth (see
#1284) so most credit goes
to Jan.

As his PR explains, OSv can be started with emulated NVMe disk on QEMU
like so:

./scripts/run.py --nvme

Compared to the Jan's PR, this patch is different in following ways:
- removes all non-NVMe changes (various bug fixes or ioctl enhancements
  are part of separate PRs)
- replaces most of the heap allocations by using stack which should
  reduce some contention
- tweaks PRP-handling code to use lock-less ring buffer which should
  further reduce contention when allocating memory
- fixes a bug in I/O queue CQ handling to correctly determine if SQ is
  not full
- assumes single namespace - 1 (most logic to deal with more has been
  preserved)
- reduces I/O queue size to 64 instead of 256
- makes code a little more DRY

Please note that, as Jan points out, the block cache logic of splitting
reads and writes into 512-byte requests causes very poor performance when
stress testing at devfs level. However, this behavior is not NVMe
specific and does not affect most applications that go through a
VFS and filesystem driver (ZFS, EXT, ROFS) which use the strategy()
method which does not use block cache.

Based on my tests, the NVMe read performance (IOPs and
bytes/s) is 60-70% of the virtio-blk on QEMU. I am not sure how much
that is because of this implementation of the NVMe driver or it is
because virtio-blk is by design much faster than anything emulated
including NVMe.

Closes #1203

Signed-off-by: Jan Braunwarth <jan.braunwarth@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
@wkozaczuk
Copy link
Collaborator

I am closing this PR as another one based on it - #1317 - has been merged into the master branch.

@wkozaczuk wkozaczuk closed this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants