NVMf offload #4

EugeneKochetov · 2018-10-10T08:20:21Z

The first version of NVMf offload implementation in SPDK. Not buildable yet and breaks some abstractions.

sashakot · 2018-10-10T13:05:30Z

lib/nvmf/rdma.c

@@ -363,6 +392,9 @@ struct spdk_nvmf_rdma_poll_group {
 /* Assuming rdma_cm uses just one protection domain per ibv_context. */
 struct spdk_nvmf_rdma_device {
 	struct ibv_device_attr			attr;
+#ifdef SPDK_CONFIG_NVMF_OFFLOAD


I think, ibv_device_attr_ex shouldn't be depend on SPDK_CONFIG_NVMF_OFFLOAD. We can reuse it for other features.
It's better to add it under
SPDK_EXTEND_OFED . Something like that.

sashakot · 2018-10-10T13:17:07Z

Hardware NVMe-oF target offload

Introduction

NVMe-over-Fabrics target offload allows an HCA to offload the complete NVMe-oF
protocol datapath at the target (storage server) side, when the backend storage
devices are locally attached NVMe PCI devices.

After correctly setting up the offload and connections to clients, every
read/write/flush operation may be completely processed in the HCA at the target
side. No software will be running on the CPU to process those offloaded IO
operations. The HCA is utilizing the PCIe peer-to-peer capability to "talk"
directly to the NVMe drives over PCI. It is required that the system
architecture will allow such peer-to-peer communications.

The software at the server side is in charge of configuring the feature and
managing the NVMe-oF control communication with the clients (via NVMe-oF admin
queue). As response to these communications, connected QPs are created with each
client (by means of RDMA-CM, as defined in the NVMe-oF standard), which
represent NVMe-oF SQ and CQ pairs. Once a connection is created, the QP is
handed to the device to start offloading all IO commands.

Software is also required to handle any error cases, and IO commands that were
not configured to be offloaded.

NVMe-oF target offload datapath

Once properly configured and connections are established, the HCA will:

Parse the RECVed NVMe-oF command capsule, and understand whether it is a READ
/ WRITE / FLUSH operation that should be offloaded.
If this is a WRITE, the HCA will RDMA_READ the data from the client to local
memory (unless it was inline with the command).
The HCA will strip the NVMe command from the capsule, place it in an NVMe
submit queue, and write to the submit queue doorbell.
The HCA will poll the NVMe completion queue, and write to the completion queue
doorbell.
If this is a READ, the HCA will RDMA_WRITE the data from the local memory to
the client.
The HCA will SEND the the NVMe completion in a response capsule back to the
client.

NVMe-oF target offload configuration

Setting up NVMe-oF target offload requires few steps:

Identify NVMe-oF offload capabilities of the device
Creating a SRQ with NVMe-oF offload attributes, to represent a single NVMe-oF
subsystem
Creating NVMe backend device objects to represent locally attached NVMe
subsystems
Setting up mappings between front-end facing namespace ids to a specific
backend NVMe objects and namespace ids
Creating QPs connected with clients (using RDMA-CM, not in the scope of this
document), bound to an SRQ with NVMe-oF offload
Modifying QP to enable NVMe-oF offload

Identify NVMe-oF offload capabilities

Software should call ibv_query_device_ex(), and test the returned
ibv_device_attr_ex.comp_mask for the avilability of NVMe-oF offload. If
available, the ibv_device_attr_ex.nvmf_caps struct holds the exact offload
capabilities and parameters of the device. These should be considered later
during the configuration.

Creating a SRQ with NVMe-oF offload attributes

A SRQ with NVMe-oF target offload represents a single NVMe-oF subsystem (a
storage target) to the fabric. Software should call ibv_create_srq_ex(), with
ibv_srq_init_attr_ex.srq_type set to nvmf_target_offload and
ibv_srq_init_attr_ex.nvmf_attr set to the specific offload parameters
requested for this SRQ. Parameters should be within the boundaries of the
respected capabilities. Along with the parameters, a staging buffer is provided
for the device use during the offload. This is a piece of memory allocated,
registered and provided via {mr, addr, len}. Software should not modify this
memory after creating the SRQ, as the device manages it by itself, and stores
there data that is in transit between network and NVMe device.

Note that this SRQ still has a receive queue. HCA will deliver to software
received commands that are not offloaded, and received commands from QPs
attached to the SRQ that are not configured with NVMF_OFFLOAD enabled.

Creating NVMe backend device objects

For the SRQ with NVMe-oF target offload feature to be able to submit work to
attached NVMe devices, software must provide the details of where to find the
NVMe submit queue, completion queue and their respective doorbells. How these
NVMe SQ, CQ and DBs are created is out of scope for this document. Normally
there should be an NVMe driver that owns the NVMe Admin Queue. By submitting
commands to this Admin Queue, SQs, CQs and DBs are generated. Software should
call ibv_srq_create_nvme_ctrl() with a set of NVMe {SQ, CQ, SQDB, CQDB} to
create an ibv_nvme_ctrl instance representing a specific NVMe backend
controller. These {SQ, CQ, SQDB, CQDB} should have been created exclusively for
this NVMe backend controller object, and this NVMe backend controller can be
used exclusively with the SRQ it was created for. SQ, CQ, SQDB and CQDB are all
provided by means of MR, address and possibly length (doorbells don't need
length as they have fixed 32 bit size). This means that those structures need to
be registered using ibv_reg_mr() before ibv_nvme_ctrl can be created.

Additionally, SQDB and CQDB initial values are provided.

Having NVMe objects created on SRQ does not yet allow servicing NVMe-oF to
clients. Namespace mappings that use these NVMe objects must be added.

Setting up namespaces mappings

When a client connects to an NVMe-oF subsystem, it will ask for the namespaces
list on that subsystem. Each namespace is identified by a namespace id (nsid),
that is then part of every IO request. The SRQ with NVMe-oF target offload
feature enabled will look at this nsid and map it to a specific nsid in one of
the NVMe backend objects created with it. Software should call
ibv_map_nvmf_nsid() to add such mappings to a SRQ. Each mapping consists of a
fabric-facing nsid and a set of {nvme_ctrl, nvme_nsid}. So IO operations
arriving from network for nsid will be submitted to nvme_ctrl, with a different
nvme_nsid. Software may create as many front-facing namespaces as needed, and
map them to different namespaces within the same nvme_ctrl or to namespaces in
different nvme_ctrls. However, as noted before, an nvme_ctrl may only be used in
mappings for the same SRQ it was created for.

After adding at least one namespace mapping, the SRQ as NVMe-oF target subsystem
is ready to service IOs.

Creating QPs

This stage is not different than any other normal QP creation and association
with SRQ. The NVMe-oF protocol standard requires that the first command on a QP
(that represents NVMe SQ) will be the CONNECT command capsule, and any other
commands should be responded with error. To meet the standard, the software
should not enable the QP NVme-oF offload (see next section) until after seeing
the CONNECT command. If a command different than CONNECT is received, software
should respond with error.

Modifying QP to enable NVMe-oF offload

Once a CONNECT command was received, software can modify the QP to enable its
NVMe-oF offload. ibv_modify_qp_nvmf() should be used to enable NVMe-oF
offload. From this point on, the HCA is taking ownership over the QP,
inspecting each command capsule received by the SRQ, and if this should be an
offloaded command, the flow described above is followed.

Note that enabling the NVMe-oF offload on the QP when created exposes the
solution to possible standard violation: if an IO command capsule will arrive
before the CONNECT request, the device will service it.

Errors and exceptions

Software should properly handle the following errors and exceptions:

Handle a non-offloaded IO request
Handle async events with QP type
Handle async events with SRQ type

Handle a non-offloaded IO request

This should be considered as a normal exception in case the SRQ was configured
to offload only part of the IO requests. In this case, software will receive
the completion on the CQ assigned with the QP, with the request in the SRQ.
Software should process the request, and it is allowed to generate RDMA
operations (reads, writes, sends) on the relevant QP in order to properly
terminate the transaction.

Handle async events with QP type

Software should listen to async events using ibv_get_async_event(). In case of
unrecoverable transport error hapenning on one of the offloaded QPs it will move
to error state and flush its queue. Since in normal operation the software may
not post to such QP and expect completions on it, the HCA will report an async
event indicating this QP has moved to error state. Software should treat this
as any other QP in error, i.e. close the connection and all its resources.

Handle async events with NVME_CTRL type

In case of unrecoverable error in HCA communication with an NVMe device, HCA
will report an async event indicating an error with NVME_CTRL. Software is
expected to remove this NVMe object and its related mappings.

sashakot · 2018-10-10T13:26:21Z

include/spdk/nvmf.h

@@ -456,6 +456,28 @@ struct spdk_nvmf_host *spdk_nvmf_subsystem_get_next_host(struct spdk_nvmf_subsys
 */
 const char *spdk_nvmf_host_get_nqn(struct spdk_nvmf_host *host);

+/**


Add Mellanox's copyright in the top of the file

sashakot · 2018-10-10T13:27:29Z

lib/nvmf/ctrlr.c

@@ -407,6 +407,14 @@ spdk_nvmf_ctrlr_connect(struct spdk_nvmf_request *req)
 			return SPDK_NVMF_REQUEST_EXEC_STATUS_ASYNCHRONOUS;
 		}
 	} else {
+#ifdef SPDK_CONFIG_NVMF_OFFLOAD
+		if (spdk_nvmf_subsystem_get_offload(subsystem)) {
+			if (!qpair->transport->qpair_enable_offload(qpair, subsystem)) {


Should we add a special function to "transport" that checks if the transport supports offload ?

sashakot · 2018-10-10T13:30:30Z

lib/nvmf/rdma.c

 		rc = ibv_query_device(device->context, &device->attr);
+#else
+		rc = ibv_query_device_ex(device->context, NULL, &device->attr_ex);


rc value from the previous call should be checked before the call

sashakot

We need to check that "offloaded" subsystem has only single name space

Large read I/O will be typical in some use cases such as web stream services. On the other hand, large write I/O may not be typical but will be sufficiently probable. Currently when large I/O is submitted to the RAID bdev, the I/O will be divided by the strip size of it and then divided I/Os are submitted sequentially. This patch tries to improve the performance of the RAID bdev in large I/Os. Besides, when the RAID bdev supports higher levels of RAID (such as RAID5), it should issue multiple I/Os to multiple base bdevs by batch fasion in the parity update. Having experience in batched I/O will be helpful in the future case too. In this patch, submit split I/Os by batch until all child IOVs are consumed or all data are submitted. If all child IOVs are consumed before all data are submitted, wait until all batched split I/Os complete and then submit again. In this patch, test code is added too. Change-Id: If6cd81cc0c306e3875a93c39dbe4288723b78937 Signed-off-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-on: https://review.gerrithub.io/424770 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

spdk_mem_map_translate() dereference a uint64_t * to get a 8-bytes long integer, but nvme_rdma_build_sgl_request() just passes a 4-bytes long integer as last parameter, this causes a stack-buffer-overflow error. Reported in https://ci.spdk.io/spdk/builds/review/3ba5ea908781fc5ad311d81bae0b7022ad7b5c51.1539172863/fedora-05/build.log Change-Id: Id1cda22114fef466dbb930b502e3a68310331f0e Signed-off-by: wuzhouhui <wuzhouhui@kingsoft.com> Reviewed-on: https://review.gerrithub.io/428693 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>

Purpose: We need to get the port info in other applications (e.g., NVMe-oF TCP/IP transport) Change-Id: I3a4636e764e44425436bb064cb0062c6f3e44035 Signed-off-by: Ziye Yang <optimistyzy@gmail.com> Reviewed-on: https://review.gerrithub.io/428313 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Tomasz Kulasek <tomaszx.kulasek@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>

mike-dubman · 2018-10-11T06:47:09Z

@yshestakov , @sashakot - when do you plan to connect it to CI

sashakot · 2018-10-11T10:29:10Z

CI is not ready yet. @yshestakov will update when it's ready

The way they were previously being checked was triggering the trap and printing out a "Configuration failed" message even though the configuration was successful. Change-Id: I9de4f390c603631ebf5af5555ea7164aae2b6213 Signed-off-by: Seth Howell <seth.howell@intel.com> Reviewed-on: https://review.gerrithub.io/428663 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

In spdk_mem_map_alloc() we only do the memory walk when notify_cb is provided, but spdk_mem_map_free() does the memory walk undonditionally. Not anymore. Change-Id: Ic8dfdc5cb2c99dc58e62ab0523cf5a18ba8691cc Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/428722 Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>

Newly registered mem maps weren't notified at vaddr 0 as we used value "0" to denote an unitialized, unset variable. Since we *do* register vaddr 0 in our memory unit tests let's switch unitialized vaddr value to something different - like UINT64_MAX. Change-Id: I9c902165e76155e068642abb9a656f3ae8ca1105 Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/428713 Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>

Moved them from test/env/vtophys to test/env/memory because that's where they should land in the first place. Our memory ut already test allocating a mem_map, so just extend them with an extra test case now. Since we're a unit test rather than a fully-fledged SPDK app, we can simplify the code a lot now - we no longer have any memory (hugepages) registered at the beginning of our test case, hence we no longer need to alloc multiple dummy mem maps to iterate through all registrations - we can simply hardcode and predict which registrations are there. Change-Id: I82cd00ea2ad2370bdc63846874885f8c55e11d53 Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/428714 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

Prevents us from unregistering two or more separately registered regions with a single notification. This fixes an ibv_mr leak in RDMA. When multiple registrations were unregistered with a single notification, only the first ibv_mr one would be freed and the remaining memory would possibly still remain DMA-able. As of now, unregistering multiple complete regions with a single unregister call is not possible. It will be implemented later, after the rest of the code is cleaned up. Change-Id: I7d61867fa61fd7a4a8a644ff45cab17125d63e1b Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/425555 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

Removed the reference count from the registrations map. Although technically supported, registering a single memory region more than once had a lot of unhandled cases and could easily lead to a segfault. RDMA maps require all memory to be unregistered in the same chunks the memory was registered, which is often impossible to achieve if a region was registered more than once: 1. register region 0x0 - 0x3 -> it gets mapped to a single ibv_mr 2. register region 0x1 - 0x2 -> nothing happens, this region is already registered 3. unregister region 0x0 - 0x3 -> 0x0-0x1 gets unregistered as one region. 0x2-0x3 gets unregistered as another (leading to segfault in the the current RDMA implementation) The problem is that the last two regions share the same ibv_mr, which SPDK tries to free twice. The second free causes a segfault. vtophys map handles this case by registering each 2MB chunk separately, but this solution cannot be applied for RDMA, as NICs put a limitation (~2048) on the number of regions registered. Another option is to keep a refcount of each ibv_mr allocated, and free it only when the entire region was unregistered from the SPDK mem map. This is however very tricky and RDMAmojo mentions that freeing a memory buffer before unregistering its ibv_mr may lead to a segfault. Change-Id: I545c56e24ffa55bda211dea22aeb8a55d9631fe5 Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/426085 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>

Added sanity checks to prevent unregistering a memory range that wasn't registered as a one, complete region. Change-Id: I819b57560b2e48b0802113ffff9f72949d7a148a Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/425556 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

All mem_maps will still receive separate unregister notification for each registered region, but the public memory unregister API is more flexible now. This follows the VFIO_TYPE1v2_IOMMU interface, which allows the same. Change-Id: Ifc008afdc6bff39d9b3b4c892c379ade10c3098e Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/428715 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

The problem with registering the entire hotplugged memory region is that it won't necessarily be unregistered in one go. Registering each hugepage separately solves that problem. This puts a limitation on the number of pages that can be allocated when using RDMA. We'll hopefully lift this limitation sometime in future - probably levereging ibv_rereg_mr, but for now we'll have to resort to either: a) using 1GB hugepages b) preallocating memory (with [-s|--mem-size <size>] app param) as it will be registered as just one region no matter what size it is. This memory won't be returned to the system until the SPDK app exits. Change-Id: I6de997fb4901b772730ba6fe995dcc0640b85749 Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/428716 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

The ENV_LINKER_ARGS was employing both the linker --[start|end]-group and --[whole|no-whole]-archive options around the DPDK_LIBs. With the use of whole/no-whole, the start/end bracketing is unnecessary. Change-Id: I97a2ac22df8c6b48ba674b9b292f5eea01823901 Signed-off-by: Lance Hartmann <lance.hartmann@oracle.com> Reviewed-on: https://review.gerrithub.io/428737 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>

It's a C libary for client to call rpc method. Change-Id: I5378747bd9dab83a41801225ba794b3910d1f5a5 Signed-off-by: Liu Xiaodong <xiaodong.liu@intel.com> Reviewed-on: https://review.gerrithub.io/424061 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

Change-Id: I9e0fc92e422de3fc65c5048a63f4c7dcc46f7324 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/428727 Reviewed-by: Seth Howell <seth.howell5141@gmail.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>

Also add some comments. Change-Id: I97c3a44f97aa3dadc114005c10bec83ae75994cf Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/428728 Reviewed-by: Seth Howell <seth.howell5141@gmail.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>

While more verbose, this makes it much more obvious that an array of SGL elements is being filled out. Change-Id: I98b8e5d46af32c5d7dbb990e267fdfd594942081 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/428729 Reviewed-by: Seth Howell <seth.howell5141@gmail.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>

This makes this particular function consistent with all of the other functions in this file, and I feel it is slightly more readable. Change-Id: I99ace5b9eb45b0f706ca85a64b155444f45c9815 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/428730 Reviewed-by: Seth Howell <seth.howell5141@gmail.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>

If the ipsec submodule is registered to spdk, an empty intel-ipsec-mb directory will be created. We could potentially try to run make inside of this empty directory, so instead do a preemptive submodule update. Change-Id: I367fdef468bf21ef91b8354155d199cea97c3daa Signed-off-by: Seth Howell <seth.howell@intel.com> Reviewed-on: https://review.gerrithub.io/428404 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

Due to change of defauly Python interpreter to Python3 we need to decode byte object from check_output() to utf-8, otherwise there is an error. Change-Id: I83e2d79ec8c3934c5c6d00768288fbb4c5a50914 Signed-off-by: Karol Latecki <karol.latecki@intel.com> Reviewed-on: https://review.gerrithub.io/428172 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>

Initiailize fields later assumed to be NULL Change-Id: I61e054dd275c6c04fb3f826adc445e56f0add331 Signed-off-by: shahar salzman <shahar.salzman@kaminario.com> Reviewed-on: https://review.gerrithub.io/428304 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

Detected aby ASAN Change-Id: I49f160ddc20334a147f39c39015cb340d29f722b Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-on: https://review.gerrithub.io/429227 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

Change-Id: Idcdaeb5603c5fbe369884ced52e569cc3149be39 Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-on: https://review.gerrithub.io/429228 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

This variable will be used for something more than just building SPDK, namely installing the freebsd contigmem kernel module. Note: Installing a module requires root priviledges and can't be done as a part of autobuild. Change-Id: I45cc797493cc4ff22c1f8d0dd5e4e56642d54d11 Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/429186 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>

While here, change spdk_lib_list_to_files to spdk_lib_list_to_static_libs to differentiate it from the new spdk_lib_list_to_shared_libs. Signed-off-by: Jim Harris <james.r.harris@intel.com> Change-Id: I6e5913addfbdd556fae2451d4e2b2c43feaf33ab Reviewed-on: https://review.gerrithub.io/429286 Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Lance Hartmann <lance.hartmann@oracle.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>

This function doesn't return error code Signed-off-by: Piotr Pelplinski <piotr.pelplinski@intel.com> Change-Id: I67a8fa7393990470e509baa8934e78bc6f6a6c9e Reviewed-on: https://review.gerrithub.io/429441 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

Signed-off-by: Piotr Pelplinski <piotr.pelplinski@intel.com> Change-Id: I875cc9d6a6bd1e9e9ac25ca9103a2070226ac236 Reviewed-on: https://review.gerrithub.io/428877 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

This patch sets optimal_io_boundary to cluster size, so that splitting happens in bdev layer rather than blobstore layer. Signed-off-by: Piotr Pelplinski <piotr.pelplinski@intel.com> Change-Id: I0230cb4a188d605845a709e9c3c9061e822ef0f5 Reviewed-on: https://review.gerrithub.io/428065 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Paul Luse <paul.e.luse@intel.com> Reviewed-by: Maciej Szwed <maciej.szwed@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

Change-Id: I499f54b025080ad1916acc0cf265a58c806da002 Signed-off-by: Pawel Kaminski <pawelx.kaminski@intel.com> Reviewed-on: https://review.gerrithub.io/428494 Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Paul Luse <paul.e.luse@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>

Change-Id: Idba8ad8afbf92c493d84271fd34443877993997a Signed-off-by: shahar salzman <shahar.salzman@kaminario.com> Reviewed-on: https://review.gerrithub.io/428305 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

Only calling spdk_clear_all_transfer_task cannot solve all the hotplug issue. The iSCSI task may successfully return and own the bdev buffer inside the iSCSI task, so we need to call this flush pdu function. Change-Id: I255173d0880334e8acccc980a4ce04c380f64435 Signed-off-by: Ziye Yang <ziye.yang@intel.com> Reviewed-on: https://review.gerrithub.io/428801 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

Reason: For connect, we use non_block mode in the initiator side, but we do not do it for the accepted fd in the server side, which will casue writev not return. And this patch can fix this. PS: SPDK default use non block mode. Change-Id: I709574573a089c2e63ca079829945e864d9f20c2 Signed-off-by: Ziye Yang <ziye.yang@intel.com> Reviewed-on: https://review.gerrithub.io/428654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: GangCao <gang.cao@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

Change-Id: I189ad8889c74937bf43bcf2c3029416ddb94976d Signed-off-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-on: https://review.gerrithub.io/425705 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Xiaodong Liu <xiaodong.liu@intel.com> Reviewed-by: Paul Luse <paul.e.luse@intel.com> Reviewed-by: GangCao <gang.cao@intel.com>

With Identify Namespace Identification Descriptors can be executed asynchronously, most of functions in the controller initialization now can be executed asynchronously now, for host with multiple controllers this can save some time during initialization. Change-Id: I70e3c6c2c691134d2ae4c5969288cced1538c6cc Signed-off-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-on: https://review.gerrithub.io/428585 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: GangCao <gang.cao@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

This leaves files created by the root user in the directory and makes future calls to make clean fail. Change-Id: Ie33d0d33e8c01a2d17f6991284c5118b5bd545ff Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/429282 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>

Previously, the difference between configuring and offline is unclear, this patch just fixes it. The key difference should be whether the raid bdev have ever registered. Offline is registered before but unregistered now, and configuring has never registered. According to the above, we should never set configuring raid bdev to offline because it never got registered. Change-Id: Id44ef6654e032993ffb8444e7e7ae3e43a9b0f16 Signed-off-by: wuzhouhui <wuzhouhui@kingsoft.com> Reviewed-on: https://review.gerrithub.io/428321 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>

If raid bdev creation failed, the bdev still be configuring and not register it. For those raid bdev, raid_bdev_remove_base_bdev() should cleanup them as well. Change-Id: If2eda8ec80e7fdeb5e551fafe57a43a27ae0f9e6 Signed-off-by: wuzhouhui <wuzhouhui@kingsoft.com> Reviewed-on: https://review.gerrithub.io/427331 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>

This will resolve out-of-space errors that have cropped up as SPDK continues to grow. There's no need to copy *.o files to the mounted filesystem - we 'make clean' right after the rsync anyways. Signed-off-by: Jim Harris <james.r.harris@intel.com> Change-Id: I6844183c527953fd4b3329f04171f05e503b04dc Reviewed-on: https://review.gerrithub.io/429517 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>

Fix variable name added in patch: https://review.gerrithub.io/#/c/spdk/spdk/+/429049/ Change-Id: I0349dfd16f784a0cc92ff64beae3389c1de8b55c Signed-off-by: Pawel Niedzwiecki <pawelx.niedzwiecki@intel.com> Reviewed-on: https://review.gerrithub.io/429485 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

This is a new feature for NVMEoF RDMA target, that is intended to save resource allocation (by sharing them) and utilize the locality (completions and memory) to get the best performance with Shared Receive Queues (SRQs). We'll create a SRQ per core (poll group), per device and associate each created QP/CQ with an appropriate SRQ. Our testing environment has 2 hosts. Host 1: CPU: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz dual socket (8 cores total) Network: ConnectX-5, ConnectX-5 VPI , 100GbE, single-port QSFP28, PCIe3.0 x16 Disk: Intel Optane SSD 900P Series OS: Fedora 27 x86_64 Host 2: CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz dual-socket (24 cores total) Network: ConnectX-4 VPI , 100GbE, dual-port QSFP28 Disk: Intel Optane SSD 900P Series OS : CentOS 7.5.1804 x86_64 Hosts are connected via Spectrum switch. Host 1 is running SPDK NVMeoF target. Host 2 is used as initiator running fio with SPDK plugin. Configuration: - SPDK NVMeoF target: cpu mask 0x0F (4 cores), max queue depth 128, max SRQ depth 1024, max QPs per controller 1024 - Single NVMf subsystem with single namespace backed by physical SSD disk - fio with SPDK plugin: randread pattern, 1-256 jobs, block size 4k, IO depth 16, cpu_mask 0xFFF0, IO rate 10k, rate process “poisson” Here is a full fio command line: fio --name=Job --stats=1 --group_reporting=1 --idle-prof=percpu --loops=1 --numjobs=1 --thread=1 --time_based=1 --runtime=30s --ramp_time=5s --bs=4k --size=4G --iodepth=16 --readwrite=randread --rwmixread=75 --randrepeat=1 --ioengine=spdk --direct=1 --gtod_reduce=0 --cpumask=0xFFF0 --rate_iops=10k --rate_process=poisson --filename=trtype=RDMA adrfam=IPv4 traddr=1.1.79.1 trsvcid=4420 ns=1 SPDK allocates the following entities for every work request in receive queue (shared or not): reqs (1024 bytes), recvs (96 bytes), cmds (64 bytes), cpls (16 bytes), in_capsule_buffer. All except the last one are fixed size. In capsule data size is configured to 4096. Memory consumption calculation (target): - Multiple SRQ: core_num * ib_devs_num * SRQ_depth * (1200 + in_capsule_data_size) - Multiple RQ: queue_num * RQ_depth * (1200 + in_capsule_data_size) We ignore admin queues in calculations for simplicity. Cases: 1. Multiple SRQ with 1024 entries: - Mem = 4 * 1 * 1024 * (1200 + 4096) = 20.7 MiB (Constant number – does not depend on initiators number) 2. RQ with 128 entries for 64 initiators: - Mem = 64 * 128 * (1200 + 4096) = 41.4 MiB Results: FIO_JOBS kIOPS Bandwidth, MiB/s AvgLatency, us MaxResidentSize, kiB RQ SRQ RQ SRQ RQ SRQ RQ SRQ 1 8.623 8.623 33.7 33.7 13.89 14.03 144376 155624 2 17.3 17.3 67.4 67.4 14.03 14.1 145776 155700 4 34.5 34.5 135 135 14.15 14.23 146540 156184 8 69.1 69.1 270 270 14.64 14.49 148116 156960 16 138 138 540 540 14.84 15.38 151216 158668 32 276 276 1079 1079 16.5 16.61 157560 161936 64 513 502 2005 1960 1673.31 1612.38 170408 168440 128 535 526 2092 2054 3329.79 3344.03 195796 181524 256 571 571 2232 2233 6854.57 6873.37 246484 207856 We can see the benefit in memory consumption. The drawback of using SRQ is a risk of RNR errors when multiple clients initiate large number of IOs simultaneously. In "RQ per QP" this is handled by Submission Queue flow control and RNR is not possible. This patch does not contain any changes to solve RNR issue but we see at least two options here: - try to increase the RNR retry count to more than 0 which is now hardcoded and make it a configurable parameter. - implement some mechanism for dynamic SRQ extension. It may scale with number of IO queues or when it reaches some threshold. Change-Id: I40c70f6ccbad7754918bcc6cb397e955b09d1033 Signed-off-by: Evgeniy Kochetov <evgeniik@mellanox.com>

…ffload API.

For VMD driver we'll need to introduce some way of iterating over all spdk pci device objects and we would like to achieve that with simple spdk_pci_get_first_dev()/get_next_dev() APIs. To make it thread safe though, we would have to expose some public pci mutex to be locked around the iteration and we don't want to do that, so we'll make PCI APIs usable from only a single thread - this will prevent any pci devices from being removed inbetween subsequent get_first/get_next calls. We currently have the following players accessing pci device state: 1) public APIs, obviously (on any thread right now) 2) VFIO hotremove callback (dpdk interrupt thread) 3) rte_eal_alarm for detaching rte_pci_devices (dpdk interrupt thread) 4) DPDK hotplug IPC (dpdk interrupt thread) There is g_pci_mutex providing the thread safety, but even today it doesn't protect #3 and #4, making the entire pci layer prone to data corruption. To make #3 and #4 safe, we would have to lock inside device init/fini callbacks (spdk_pci_device_init/fini), but those are called directly inside the public device attach/detach functions which already lock. So now, with the decision to drop thread safety from public pci APIs, we narrow down the locks inside public functions and introduce locks inside those lower-level init/fini callbacks. Change-Id: I5dcbc9cdcbab65ee76cd3c42890f596069ec9a8a Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-on: https://review.gerrithub.io/c/spdk/spdk/+/458930 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>

Not all JSON methods require 'params' field to be supplied. Verification of the JSON is done on server side in parse_single_request(). We should not attempt to process garbage values on correct JSON config file during app start. Segfault can be observed if following valid JSON config is supplied: { "method": "framework_wait_init" } Resulting in: json_config.c:388:13: runtime error: applying non-zero offset 18446744073709551600 to null pointer AddressSanitizer:DEADLYSIGNAL ================================================================= ==3386067==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0000007260ff bp 0x7ffe6ea06890 sp 0x7ffe6ea067e0 T0) ==3386067==The signal is caused by a READ memory access. ==3386067==Hint: this fault was caused by a dereference of a high value address (see register values below). Dissassemble the provided pc to learn which register was used. #0 0x7260ff in app_json_config_load_subsystem_config_entry /home/tzawadzk/spdk/lib/event/json_config.c:391 #1 0x7cbb13 in msg_queue_run_batch /home/tzawadzk/spdk/lib/thread/thread.c:505 #2 0x7cd00a in thread_poll /home/tzawadzk/spdk/lib/thread/thread.c:581 #3 0x7cfe18 in spdk_thread_poll /home/tzawadzk/spdk/lib/thread/thread.c:689 #4 0x71d6ef in _reactor_run /home/tzawadzk/spdk/lib/event/reactor.c:326 #5 0x71eb00 in reactor_run /home/tzawadzk/spdk/lib/event/reactor.c:382 #6 0x71f911 in spdk_reactors_start /home/tzawadzk/spdk/lib/event/reactor.c:477 #7 0x718237 in spdk_app_start /home/tzawadzk/spdk/lib/event/app.c:691 #8 0x407e94 in main /home/tzawadzk/spdk/app/spdk_tgt/spdk_tgt.c:120 #9 0x7f0f2eef2041 in __libc_start_main ../csu/libc-start.c:308 #10 0x4079ad in _start (/home/tzawadzk/spdk/build/bin/spdk_tgt+0x4079ad) Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com> Change-Id: I7ef1a764467817ad788fdf5dbe17eaeb99dcc22e Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3256 Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>

The bdev layer doesn't call the destruct callback until all channels have been released, but because the channel delete callback passes message to the main thread, we can end up with a complicated race condition. Currently we have a deferred_free code path to handle this race, but we can handle this a bit more cleanly by doing the construct operation on the main_td as well. This also simplifies the next patch which will asynchronously destruct the bdev to fix an RPC bug. Here's the race: 1) first channel was created on thread A, so disk->main_td = thread A 2) second channel was created on thread B 3) first channel is freed (but disk->main_td is still thread A) 4) spdk_bdev_unregister is called on thread C 5) bdev layer gives callback on thread B to upper layer 6) upper layer on thread B frees channel 7) bdev_rbd_destroy_cb runs on thread B and has to send msg to thread A for processing 8) bdev layer calls bdev_rbd_destruct on thread C (since step #4 was on thread C) Signed-off-by: Jim Harris <james.r.harris@intel.com> Change-Id: I25ede2dc56e24dac0919aed05b9def2560823ee7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/9158 Community-CI: Broadcom CI <spdk-ci.pdl@broadcom.com> Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Ziye Yang <ziye.yang@intel.com>

The controller data structure may be freed before subsystem resume done callback, we can take endpoint as the input parameter to avoid this issue. AddressSanitizer: heap-use-after-free on address 0x625000046100 at pc 0x00000082818f bp 0x7fff7b09bd10 sp 0x7fff7b09bd00 READ of size 8 at 0x625000046100 thread T0 (reactor_0) #0 0x82818e in vfio_user_dev_quiesce_resume_done /spdk/lib/nvmf/vfio_user.c:2147 #1 0x782cc0 in subsystem_state_change_done /spdk/lib/nvmf/subsystem.c:634 #2 0xad047b in _call_completion /spdk/lib/thread/thread.c:2344 #3 0xabc48d in msg_queue_run_batch /spdk/lib/thread/thread.c:710 #4 0xac0670 in thread_poll /spdk/lib/thread/thread.c:926 #5 0xac0ead in spdk_thread_poll /spdk/lib/thread/thread.c:986 #6 0x9a5b4f in _reactor_run /spdk/lib/event/reactor.c:920 #7 0x9a6442 in reactor_run /spdk/lib/event/reactor.c:958 #8 0x9a717c in spdk_reactors_start /spdk/lib/event/reactor.c:1060 #9 0x99884a in spdk_app_start /spdk/lib/event/app.c:643 #10 0x407e82 in main /spdk/app/nvmf_tgt/nvmf_main.c:75 #11 0x7f822095ff42 in __libc_start_main (/lib64/libc.so.6+0x23f42) #12 0x407abd in _start (/spdk/build/bin/nvmf_tgt+0x407abd) 0x625000046100 is located 0 bytes inside of 8320-byte region [0x625000046100,0x625000048180) freed by thread T0 (reactor_0) here: #0 0x7f82219ff91f in __interceptor_free (/lib64/libasan.so.5+0x10d91f) #1 0x837059 in _free_ctrlr /spdk/lib/nvmf/vfio_user.c:2976 #2 0x837327 in free_ctrlr /spdk/lib/nvmf/vfio_user.c:2996 #3 0x843541 in nvmf_vfio_user_close_qpair /spdk/lib/nvmf/vfio_user.c:3742 #4 0x7d1d91 in nvmf_transport_qpair_fini /spdk/lib/nvmf/transport.c:604 #5 0x7ad922 in _nvmf_qpair_destroy /spdk/lib/nvmf/nvmf.c:1055 #6 0x761362 in nvmf_qpair_request_cleanup /spdk/lib/nvmf/ctrlr.c:4026 #7 0x761906 in spdk_nvmf_request_free /spdk/lib/nvmf/ctrlr.c:4041 #8 0x75a931 in nvmf_qpair_free_aer /spdk/lib/nvmf/ctrlr.c:3576 #9 0x7ae626 in spdk_nvmf_qpair_disconnect /spdk/lib/nvmf/nvmf.c:1127 #10 0x83db36 in _vfio_user_qpair_disconnect /spdk/lib/nvmf/vfio_user.c:3433 #11 0xabc48d in msg_queue_run_batch /spdk/lib/thread/thread.c:710 #12 0xac0670 in thread_poll /spdk/lib/thread/thread.c:926 #13 0xac0ead in spdk_thread_poll /spdk/lib/thread/thread.c:986 #14 0x9a5b4f in _reactor_run /spdk/lib/event/reactor.c:920 #15 0x9a6442 in reactor_run /spdk/lib/event/reactor.c:958 #16 0x9a717c in spdk_reactors_start /spdk/lib/event/reactor.c:1060 #17 0x99884a in spdk_app_start /spdk/lib/event/app.c:643 #18 0x407e82 in main /spdk/app/nvmf_tgt/nvmf_main.c:75 #19 0x7f822095ff42 in __libc_start_main (/lib64/libc.so.6+0x23f42) previously allocated by thread T0 (reactor_0) here: #0 0x7f82219fff16 in __interceptor_calloc (/lib64/libasan.so.5+0x10df16) #1 0x837413 in nvmf_vfio_user_create_ctrlr /spdk/lib/nvmf/vfio_user.c:3010 #2 0x83bc68 in nvmf_vfio_user_accept /spdk/lib/nvmf/vfio_user.c:3313 #3 0xabfbd8 in thread_execute_timed_poller /spdk/lib/thread/thread.c:872 #4 0xac0c75 in thread_poll /spdk/lib/thread/thread.c:960 #5 0xac0ead in spdk_thread_poll /spdk/lib/thread/thread.c:986 #6 0x9a5b4f in _reactor_run /spdk/lib/event/reactor.c:920 #7 0x9a6442 in reactor_run /spdk/lib/event/reactor.c:958 #8 0x9a717c in spdk_reactors_start /spdk/lib/event/reactor.c:1060 #9 0x99884a in spdk_app_start /spdk/lib/event/app.c:643 #10 0x407e82 in main /spdk/app/nvmf_tgt/nvmf_main.c:75 #11 0x7f822095ff42 in __libc_start_main (/lib64/libc.so.6+0x23f42) SUMMARY: AddressSanitizer: heap-use-after-free /spdk/lib/nvmf/vfio_user.c:2147 in vfio_user_dev_quiesce_resume_done Change-Id: Icf5e5b360b9107a3c5eb960ae59b7fe10ace1c66 Signed-off-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/11420 Community-CI: Broadcom CI <spdk-ci.pdl@broadcom.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Dong Yi <dongx.yi@intel.com> Reviewed-by: John Levon <levon@movementarian.org> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>

Ubsan with clang complains when using spdk_ioviter with more iters than declared in the array: iov.c:69:9: runtime error: index 3 out of bounds for type 'struct spdk_single_ioviter[2]' #0 0x5df709 in spdk_ioviter_firstv /home/vagrant/spdk_repo/spdk/lib/util/iov.c:69:9 #1 0x53780b in raid5f_xor_stripe /home/vagrant/spdk_repo/spdk/module/bdev/raid/raid5f.c:270:24 #2 0x531bd8 in raid5f_submit_write_request /home/vagrant/spdk_repo/spdk/module/bdev/raid/raid5f.c:520:2 #3 0x52a03a in raid5f_submit_rw_request /home/vagrant/spdk_repo/spdk/module/bdev/raid/raid5f.c:596:9 #4 0x548c17 in test_raid5f_write_request /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:550:2 #5 0x544e18 in test_raid5f_submit_rw_request /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:714:3 #6 0x553e61 in __test_raid5f_submit_full_stripe_write_request /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:878:3 #7 0x543f84 in run_for_each_raid5f_config /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:748:3 #8 0x527ac1 in test_raid5f_submit_full_stripe_write_request /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:885:2 #9 0x7f4a71a0960a (/usr/lib64/libcunit.so.1+0x460a) (BuildId: 9c82dd336cbccd99721651ac0a04435e746e0fc0) #10 0x7f4a71a09937 (/usr/lib64/libcunit.so.1+0x4937) (BuildId: 9c82dd336cbccd99721651ac0a04435e746e0fc0) #11 0x7f4a71a0a897 in CU_run_all_tests (/usr/lib64/libcunit.so.1+0x5897) (BuildId: 9c82dd336cbccd99721651ac0a04435e746e0fc0) #12 0x524fe8 in main /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:1006:2 #13 0x7f4a711d750f in __libc_start_call_main (/usr/lib64/libc.so.6+0x2750f) (BuildId: 81daba31ee66dbd63efdc4252a872949d874d136) #14 0x7f4a711d75c8 in __libc_start_main@GLIBC_2.2.5 (/usr/lib64/libc.so.6+0x275c8) (BuildId: 81daba31ee66dbd63efdc4252a872949d874d136) #15 0x4235b4 in _start (/home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut+0x4235b4) (BuildId: 028d075edd1a7cd17881fd678ef076adfdbac13d) Fix this by making iters a zero-length array and put it in a union with a two-element array to keep the default size for compatibility. Change-Id: I8573b015755e9986cdadbfa1705d269d51a7c2b7 Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/18402 Reviewed-by: Jim Harris <james.r.harris@intel.com> Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com>

As per typedef in nvme.h the spdk_nvme_cpl argument should be a pointer to a const struct. This fixes runtimer error under clang >= 17.x which now makes the -fsanitize=function available for C and which on our end is being enabled via -fsanitize=undefined under UBSAN. Error in question: Test: test_spdk_nvme_detach ...passed Test: test_nvme_completion_poll_cb ...passed Test: test_nvme_user_copy_cmd_complete .../root/spdk/lib/nvme/nvme.c:417:2: runtime error: call to function dummy_cb through pointer to incorrect function type 'void (*)(void *, const struct spdk_nvme_cpl *)' /root/spdk/test/unit/lib/nvme/nvme.c/nvme_ut.c:584: note: dummy_cb defined here #0 0x5098e0 in nvme_user_copy_cmd_complete /root/spdk/lib/nvme/nvme.c:417:2 #1 0x532161 in test_nvme_user_copy_cmd_complete /root/spdk/test/unit/lib/nvme/nvme.c/nvme_ut.c:604:2 #2 0x7f08c952266a (/usr/lib64/libcunit.so.1+0x466a) (BuildId: d99e3b60795f2ce01ada820b4b7e3cd84d8150fe) #3 0x7f08c95229c7 (/usr/lib64/libcunit.so.1+0x49c7) (BuildId: d99e3b60795f2ce01ada820b4b7e3cd84d8150fe) #4 0x7f08c9523a9f in CU_run_all_tests (/usr/lib64/libcunit.so.1+0x5a9f) (BuildId: d99e3b60795f2ce01ada820b4b7e3cd84d8150fe) #5 0x55555e in run_tests /root/spdk/lib/ut/ut.c:169:3 #6 0x552aec in spdk_ut_run_tests /root/spdk/lib/ut/ut.c:225:8 #7 0x522d52 in main /root/spdk/test/unit/lib/nvme/nvme.c/nvme_ut.c:1664:17 #8 0x7f08c8c28149 in __libc_start_call_main (/usr/lib64/libc.so.6+0x28149) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756) #9 0x7f08c8c2820a in __libc_start_main@GLIBC_2.2.5 (/usr/lib64/libc.so.6+0x2820a) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756) #10 0x42b6a4 in _start (/root/spdk/test/unit/lib/nvme/nvme.c/nvme_ut+0x42b6a4) (BuildId: 6fc2caaf777030becad2d0f660ec68443f3380b4) SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /root/spdk/lib/nvme/nvme.c:417:2 in ./test/unit/unittest.sh: line 85: 75349 Aborted (core dumped) $valgrind $testdir/lib/nvme/nvme.c/nvme_ut Change-Id: Iddbd5fc0dee0ef6a6cc1f032e079f6119e76aed9 Signed-off-by: Michal Berger <michal.berger@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/22025 Reviewed-by: Jim Harris <jim.harris@samsung.com> Community-CI: Mellanox Build Bot Reviewed-by: Konrad Sztyber <konrad.sztyber@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>

This is a cumulative patch that addresses the comments on the mainline SPDK CR ([1]) and specifically - the naming related one ([2]). [1] https://review.spdk.io/gerrit/c/spdk/spdk/+/22511/ [2] https://review.spdk.io/gerrit/c/spdk/spdk/+/22511/comment/296bf45a_eabb58af/ Change-Id: I78ab7b8409ff7be513c1fb18af32c8637500a848 Signed-off-by: Anton Nayshtut <anayshtut@nvidia.com>

sashakot reviewed Oct 10, 2018

View reviewed changes

sashakot added the WIP Work in progress label Oct 10, 2018

sashakot requested a review from SergeyGorenko October 10, 2018 13:18

sashakot reviewed Oct 10, 2018

View reviewed changes

Shuhei Matsumoto and others added 3 commits October 10, 2018 17:19

Seth5141 and others added 17 commits October 11, 2018 18:50

ShaharSalzman-K and others added 21 commits October 12, 2018 22:50

NVMf offload: the code is now buildable and runnable with stubs for o…

cac7cff

…ffload API.

EugeneKochetov closed this Dec 12, 2019

EugeneKochetov deleted the dev/nvmf_offload branch December 12, 2019 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVMf offload #4

NVMf offload #4

EugeneKochetov commented Oct 10, 2018

sashakot Oct 10, 2018

sashakot commented Oct 10, 2018

sashakot Oct 10, 2018

sashakot Oct 10, 2018

sashakot Oct 10, 2018

sashakot left a comment

mike-dubman commented Oct 11, 2018

sashakot commented Oct 11, 2018

NVMf offload #4

NVMf offload #4

Conversation

EugeneKochetov commented Oct 10, 2018

sashakot Oct 10, 2018

Choose a reason for hiding this comment

sashakot commented Oct 10, 2018

Hardware NVMe-oF target offload

Introduction

NVMe-oF target offload datapath

NVMe-oF target offload configuration

Identify NVMe-oF offload capabilities

Creating a SRQ with NVMe-oF offload attributes

Creating NVMe backend device objects

Setting up namespaces mappings

Creating QPs

Modifying QP to enable NVMe-oF offload

Errors and exceptions

Handle a non-offloaded IO request

Handle async events with QP type

Handle async events with NVME_CTRL type

sashakot Oct 10, 2018

Choose a reason for hiding this comment

sashakot Oct 10, 2018

Choose a reason for hiding this comment

sashakot Oct 10, 2018

Choose a reason for hiding this comment

sashakot left a comment

Choose a reason for hiding this comment

mike-dubman commented Oct 11, 2018

sashakot commented Oct 11, 2018