Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVMf offload #4

Closed
wants to merge 52 commits into from
Closed

NVMf offload #4

wants to merge 52 commits into from

Conversation

EugeneKochetov
Copy link

The first version of NVMf offload implementation in SPDK. Not buildable yet and breaks some abstractions.

@@ -363,6 +392,9 @@ struct spdk_nvmf_rdma_poll_group {
/* Assuming rdma_cm uses just one protection domain per ibv_context. */
struct spdk_nvmf_rdma_device {
struct ibv_device_attr attr;
#ifdef SPDK_CONFIG_NVMF_OFFLOAD

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, ibv_device_attr_ex shouldn't be depend on SPDK_CONFIG_NVMF_OFFLOAD. We can reuse it for other features.
It's better to add it under
SPDK_EXTEND_OFED . Something like that.

@sashakot sashakot added the WIP Work in progress label Oct 10, 2018
@sashakot
Copy link

Hardware NVMe-oF target offload

Introduction

NVMe-over-Fabrics target offload allows an HCA to offload the complete NVMe-oF
protocol datapath at the target (storage server) side, when the backend storage
devices are locally attached NVMe PCI devices.

After correctly setting up the offload and connections to clients, every
read/write/flush operation may be completely processed in the HCA at the target
side. No software will be running on the CPU to process those offloaded IO
operations. The HCA is utilizing the PCIe peer-to-peer capability to "talk"
directly to the NVMe drives over PCI. It is required that the system
architecture will allow such peer-to-peer communications.

The software at the server side is in charge of configuring the feature and
managing the NVMe-oF control communication with the clients (via NVMe-oF admin
queue). As response to these communications, connected QPs are created with each
client (by means of RDMA-CM, as defined in the NVMe-oF standard), which
represent NVMe-oF SQ and CQ pairs. Once a connection is created, the QP is
handed to the device to start offloading all IO commands.

Software is also required to handle any error cases, and IO commands that were
not configured to be offloaded.

NVMe-oF target offload datapath

Once properly configured and connections are established, the HCA will:

  • Parse the RECVed NVMe-oF command capsule, and understand whether it is a READ
    / WRITE / FLUSH operation that should be offloaded.
  • If this is a WRITE, the HCA will RDMA_READ the data from the client to local
    memory (unless it was inline with the command).
  • The HCA will strip the NVMe command from the capsule, place it in an NVMe
    submit queue, and write to the submit queue doorbell.
  • The HCA will poll the NVMe completion queue, and write to the completion queue
    doorbell.
  • If this is a READ, the HCA will RDMA_WRITE the data from the local memory to
    the client.
  • The HCA will SEND the the NVMe completion in a response capsule back to the
    client.

NVMe-oF target offload configuration

Setting up NVMe-oF target offload requires few steps:

  1. Identify NVMe-oF offload capabilities of the device
  2. Creating a SRQ with NVMe-oF offload attributes, to represent a single NVMe-oF
    subsystem
  3. Creating NVMe backend device objects to represent locally attached NVMe
    subsystems
  4. Setting up mappings between front-end facing namespace ids to a specific
    backend NVMe objects and namespace ids
  5. Creating QPs connected with clients (using RDMA-CM, not in the scope of this
    document), bound to an SRQ with NVMe-oF offload
  6. Modifying QP to enable NVMe-oF offload

Identify NVMe-oF offload capabilities

Software should call ibv_query_device_ex(), and test the returned
ibv_device_attr_ex.comp_mask for the avilability of NVMe-oF offload. If
available, the ibv_device_attr_ex.nvmf_caps struct holds the exact offload
capabilities and parameters of the device. These should be considered later
during the configuration.

Creating a SRQ with NVMe-oF offload attributes

A SRQ with NVMe-oF target offload represents a single NVMe-oF subsystem (a
storage target) to the fabric. Software should call ibv_create_srq_ex(), with
ibv_srq_init_attr_ex.srq_type set to nvmf_target_offload and
ibv_srq_init_attr_ex.nvmf_attr set to the specific offload parameters
requested for this SRQ. Parameters should be within the boundaries of the
respected capabilities. Along with the parameters, a staging buffer is provided
for the device use during the offload. This is a piece of memory allocated,
registered and provided via {mr, addr, len}. Software should not modify this
memory after creating the SRQ, as the device manages it by itself, and stores
there data that is in transit between network and NVMe device.

Note that this SRQ still has a receive queue. HCA will deliver to software
received commands that are not offloaded, and received commands from QPs
attached to the SRQ that are not configured with NVMF_OFFLOAD enabled.

Creating NVMe backend device objects

For the SRQ with NVMe-oF target offload feature to be able to submit work to
attached NVMe devices, software must provide the details of where to find the
NVMe submit queue, completion queue and their respective doorbells. How these
NVMe SQ, CQ and DBs are created is out of scope for this document. Normally
there should be an NVMe driver that owns the NVMe Admin Queue. By submitting
commands to this Admin Queue, SQs, CQs and DBs are generated. Software should
call ibv_srq_create_nvme_ctrl() with a set of NVMe {SQ, CQ, SQDB, CQDB} to
create an ibv_nvme_ctrl instance representing a specific NVMe backend
controller. These {SQ, CQ, SQDB, CQDB} should have been created exclusively for
this NVMe backend controller object, and this NVMe backend controller can be
used exclusively with the SRQ it was created for. SQ, CQ, SQDB and CQDB are all
provided by means of MR, address and possibly length (doorbells don't need
length as they have fixed 32 bit size). This means that those structures need to
be registered using ibv_reg_mr() before ibv_nvme_ctrl can be created.

Additionally, SQDB and CQDB initial values are provided.

Having NVMe objects created on SRQ does not yet allow servicing NVMe-oF to
clients. Namespace mappings that use these NVMe objects must be added.

Setting up namespaces mappings

When a client connects to an NVMe-oF subsystem, it will ask for the namespaces
list on that subsystem. Each namespace is identified by a namespace id (nsid),
that is then part of every IO request. The SRQ with NVMe-oF target offload
feature enabled will look at this nsid and map it to a specific nsid in one of
the NVMe backend objects created with it. Software should call
ibv_map_nvmf_nsid() to add such mappings to a SRQ. Each mapping consists of a
fabric-facing nsid and a set of {nvme_ctrl, nvme_nsid}. So IO operations
arriving from network for nsid will be submitted to nvme_ctrl, with a different
nvme_nsid. Software may create as many front-facing namespaces as needed, and
map them to different namespaces within the same nvme_ctrl or to namespaces in
different nvme_ctrls. However, as noted before, an nvme_ctrl may only be used in
mappings for the same SRQ it was created for.

After adding at least one namespace mapping, the SRQ as NVMe-oF target subsystem
is ready to service IOs.

Creating QPs

This stage is not different than any other normal QP creation and association
with SRQ. The NVMe-oF protocol standard requires that the first command on a QP
(that represents NVMe SQ) will be the CONNECT command capsule, and any other
commands should be responded with error. To meet the standard, the software
should not enable the QP NVme-oF offload (see next section) until after seeing
the CONNECT command. If a command different than CONNECT is received, software
should respond with error.

Modifying QP to enable NVMe-oF offload

Once a CONNECT command was received, software can modify the QP to enable its
NVMe-oF offload. ibv_modify_qp_nvmf() should be used to enable NVMe-oF
offload. From this point on, the HCA is taking ownership over the QP,
inspecting each command capsule received by the SRQ, and if this should be an
offloaded command, the flow described above is followed.

Note that enabling the NVMe-oF offload on the QP when created exposes the
solution to possible standard violation: if an IO command capsule will arrive
before the CONNECT request, the device will service it.

Errors and exceptions

Software should properly handle the following errors and exceptions:

  1. Handle a non-offloaded IO request
  2. Handle async events with QP type
  3. Handle async events with SRQ type

Handle a non-offloaded IO request

This should be considered as a normal exception in case the SRQ was configured
to offload only part of the IO requests. In this case, software will receive
the completion on the CQ assigned with the QP, with the request in the SRQ.
Software should process the request, and it is allowed to generate RDMA
operations (reads, writes, sends) on the relevant QP in order to properly
terminate the transaction.

Handle async events with QP type

Software should listen to async events using ibv_get_async_event(). In case of
unrecoverable transport error hapenning on one of the offloaded QPs it will move
to error state and flush its queue. Since in normal operation the software may
not post to such QP and expect completions on it, the HCA will report an async
event indicating this QP has moved to error state. Software should treat this
as any other QP in error, i.e. close the connection and all its resources.

Handle async events with NVME_CTRL type

In case of unrecoverable error in HCA communication with an NVMe device, HCA
will report an async event indicating an error with NVME_CTRL. Software is
expected to remove this NVMe object and its related mappings.

@@ -456,6 +456,28 @@ struct spdk_nvmf_host *spdk_nvmf_subsystem_get_next_host(struct spdk_nvmf_subsys
*/
const char *spdk_nvmf_host_get_nqn(struct spdk_nvmf_host *host);

/**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Mellanox's copyright in the top of the file

lib/nvmf/ctrlr.c Outdated
@@ -407,6 +407,14 @@ spdk_nvmf_ctrlr_connect(struct spdk_nvmf_request *req)
return SPDK_NVMF_REQUEST_EXEC_STATUS_ASYNCHRONOUS;
}
} else {
#ifdef SPDK_CONFIG_NVMF_OFFLOAD
if (spdk_nvmf_subsystem_get_offload(subsystem)) {
if (!qpair->transport->qpair_enable_offload(qpair, subsystem)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a special function to "transport" that checks if the transport supports offload ?

lib/nvmf/rdma.c Outdated
rc = ibv_query_device(device->context, &device->attr);
#else
rc = ibv_query_device_ex(device->context, NULL, &device->attr_ex);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rc value from the previous call should be checked before the call

Copy link

@sashakot sashakot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check that "offloaded" subsystem has only single name space

Shuhei Matsumoto and others added 3 commits October 10, 2018 17:19
Large read I/O will be typical in some use cases such as
web stream services. On the other hand, large write I/O
may not be typical but will be sufficiently probable.

Currently when large I/O is submitted to the RAID bdev,
the I/O will be divided by the strip size of it and then
divided I/Os are submitted sequentially.

This patch tries to improve the performance of the RAID bdev
in large I/Os. Besides, when the RAID bdev supports higher
levels of RAID (such as RAID5), it should issue multiple
I/Os to multiple base bdevs by batch fasion in the parity
update. Having experience in batched I/O will be helpful
in the future case too.

In this patch, submit split I/Os by batch until all child IOVs
are consumed or all data are submitted. If all child IOVs are
consumed before all data are submitted, wait until all batched
split I/Os complete and then submit again.

In this patch, test code is added too.

Change-Id: If6cd81cc0c306e3875a93c39dbe4288723b78937
Signed-off-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
Reviewed-on: https://review.gerrithub.io/424770
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
spdk_mem_map_translate() dereference a uint64_t * to get a
8-bytes long integer, but nvme_rdma_build_sgl_request() just passes
a 4-bytes long integer as last parameter, this causes a
stack-buffer-overflow error.

Reported in https://ci.spdk.io/spdk/builds/review/3ba5ea908781fc5ad311d81bae0b7022ad7b5c51.1539172863/fedora-05/build.log

Change-Id: Id1cda22114fef466dbb930b502e3a68310331f0e
Signed-off-by: wuzhouhui <wuzhouhui@kingsoft.com>
Reviewed-on: https://review.gerrithub.io/428693
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
Purpose: We need to get the port info in other applications
(e.g., NVMe-oF TCP/IP transport)

Change-Id: I3a4636e764e44425436bb064cb0062c6f3e44035
Signed-off-by: Ziye Yang <optimistyzy@gmail.com>
Reviewed-on: https://review.gerrithub.io/428313
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Tomasz Kulasek <tomaszx.kulasek@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
@mike-dubman
Copy link

@yshestakov , @sashakot - when do you plan to connect it to CI

@sashakot
Copy link

CI is not ready yet. @yshestakov will update when it's ready

Seth5141 and others added 17 commits October 11, 2018 18:50
The way they were previously being checked was triggering the trap and
printing out a "Configuration failed" message even though the
configuration was successful.

Change-Id: I9de4f390c603631ebf5af5555ea7164aae2b6213
Signed-off-by: Seth Howell <seth.howell@intel.com>
Reviewed-on: https://review.gerrithub.io/428663
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
In spdk_mem_map_alloc() we only do the memory walk when
notify_cb is provided, but spdk_mem_map_free() does the
memory walk undonditionally. Not anymore.

Change-Id: Ic8dfdc5cb2c99dc58e62ab0523cf5a18ba8691cc
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/428722
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Newly registered mem maps weren't notified at vaddr 0
as we used value "0" to denote an unitialized, unset
variable. Since we *do* register vaddr 0 in our memory
unit tests let's switch unitialized vaddr value to
something different - like UINT64_MAX.

Change-Id: I9c902165e76155e068642abb9a656f3ae8ca1105
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/428713
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Moved them from test/env/vtophys to test/env/memory
because that's where they should land in the first
place. Our memory ut already test allocating a mem_map,
so just extend them with an extra test case now. Since we're
a unit test rather than a fully-fledged SPDK app, we can
simplify the code a lot now - we no longer have any memory
(hugepages) registered at the beginning of our test case,
hence we no longer need to alloc multiple dummy mem maps to
iterate through all registrations - we can simply hardcode
and predict which registrations are there.

Change-Id: I82cd00ea2ad2370bdc63846874885f8c55e11d53
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/428714
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Prevents us from unregistering two or more separately
registered regions with a single notification.

This fixes an ibv_mr leak in RDMA. When multiple registrations
were unregistered with a single notification, only the first
ibv_mr one would be freed and the remaining memory would
possibly still remain DMA-able.

As of now, unregistering multiple complete regions with
a single unregister call is not possible. It will be
implemented later, after the rest of the code is cleaned up.

Change-Id: I7d61867fa61fd7a4a8a644ff45cab17125d63e1b
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/425555
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Removed the reference count from the registrations map.

Although technically supported, registering a single memory
region more than once had a lot of unhandled cases and could
easily lead to a segfault.

RDMA maps require all memory to be unregistered in the same
chunks the memory was registered, which is often impossible
to achieve if a region was registered more than once:

1. register region    0x0 - 0x3 -> it gets mapped to
                                   a single ibv_mr
2. register region    0x1 - 0x2 -> nothing happens, this region
                                   is already registered
3. unregister region  0x0 - 0x3 -> 0x0-0x1 gets unregistered as
                                   one region. 0x2-0x3 gets
                                   unregistered as another
                                   (leading to segfault in the
                                   the current RDMA implementation)

The problem is that the last two regions share the same ibv_mr,
which SPDK tries to free twice. The second free causes a segfault.
vtophys map handles this case by registering each 2MB chunk
separately, but this solution cannot be applied for RDMA, as
NICs put a limitation (~2048) on the number of regions registered.

Another option is to keep a refcount of each ibv_mr allocated,
and free it only when the entire region was unregistered from the
SPDK mem map. This is however very tricky and RDMAmojo mentions
that freeing a memory buffer before unregistering its ibv_mr
may lead to a segfault.

Change-Id: I545c56e24ffa55bda211dea22aeb8a55d9631fe5
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/426085
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Added sanity checks to prevent unregistering a memory
range that wasn't registered as a one, complete region.

Change-Id: I819b57560b2e48b0802113ffff9f72949d7a148a
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/425556
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
All mem_maps will still receive separate unregister
notification for each registered region, but the
public memory unregister API is more flexible now.

This follows the VFIO_TYPE1v2_IOMMU interface, which
allows the same.

Change-Id: Ifc008afdc6bff39d9b3b4c892c379ade10c3098e
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/428715
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
The problem with registering the entire hotplugged memory
region is that it won't necessarily be unregistered in one
go. Registering each hugepage separately solves that
problem.

This puts a limitation on the number of pages that can
be allocated when using RDMA. We'll hopefully lift this
limitation sometime in future - probably levereging
ibv_rereg_mr, but for now we'll have to resort to either:

 a) using 1GB hugepages
 b) preallocating memory (with [-s|--mem-size <size>] app
    param) as it will be registered as just one region no
    matter what size it is. This memory won't be returned
    to the system until the SPDK app exits.

Change-Id: I6de997fb4901b772730ba6fe995dcc0640b85749
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/428716
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
The ENV_LINKER_ARGS was employing both the linker --[start|end]-group
and --[whole|no-whole]-archive options around the DPDK_LIBs. With
the use of whole/no-whole, the start/end bracketing is unnecessary.

Change-Id: I97a2ac22df8c6b48ba674b9b292f5eea01823901
Signed-off-by: Lance Hartmann <lance.hartmann@oracle.com>
Reviewed-on: https://review.gerrithub.io/428737
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
It's a C libary for client to call rpc method.

Change-Id: I5378747bd9dab83a41801225ba794b3910d1f5a5
Signed-off-by: Liu Xiaodong <xiaodong.liu@intel.com>
Reviewed-on: https://review.gerrithub.io/424061
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Change-Id: I9e0fc92e422de3fc65c5048a63f4c7dcc46f7324
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-on: https://review.gerrithub.io/428727
Reviewed-by: Seth Howell <seth.howell5141@gmail.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Also add some comments.

Change-Id: I97c3a44f97aa3dadc114005c10bec83ae75994cf
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-on: https://review.gerrithub.io/428728
Reviewed-by: Seth Howell <seth.howell5141@gmail.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
While more verbose, this makes it much more obvious that
an array of SGL elements is being filled out.

Change-Id: I98b8e5d46af32c5d7dbb990e267fdfd594942081
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-on: https://review.gerrithub.io/428729
Reviewed-by: Seth Howell <seth.howell5141@gmail.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
This makes this particular function consistent
with all of the other functions in this file, and
I feel it is slightly more readable.

Change-Id: I99ace5b9eb45b0f706ca85a64b155444f45c9815
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-on: https://review.gerrithub.io/428730
Reviewed-by: Seth Howell <seth.howell5141@gmail.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
If the ipsec submodule is registered to spdk, an empty intel-ipsec-mb
directory will be created. We could potentially try to run make inside
of this empty directory, so instead do a preemptive submodule update.

Change-Id: I367fdef468bf21ef91b8354155d199cea97c3daa
Signed-off-by: Seth Howell <seth.howell@intel.com>
Reviewed-on: https://review.gerrithub.io/428404
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Due to change of defauly Python interpreter to Python3
we need to decode byte object from check_output()
to utf-8, otherwise there is an error.

Change-Id: I83e2d79ec8c3934c5c6d00768288fbb4c5a50914
Signed-off-by: Karol Latecki <karol.latecki@intel.com>
Reviewed-on: https://review.gerrithub.io/428172
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
ShaharSalzman-K and others added 21 commits October 12, 2018 22:50
Initiailize fields later assumed to be NULL

Change-Id: I61e054dd275c6c04fb3f826adc445e56f0add331
Signed-off-by: shahar salzman <shahar.salzman@kaminario.com>
Reviewed-on: https://review.gerrithub.io/428304
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Detected aby ASAN


Change-Id: I49f160ddc20334a147f39c39015cb340d29f722b
Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Reviewed-on: https://review.gerrithub.io/429227
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Change-Id: Idcdaeb5603c5fbe369884ced52e569cc3149be39
Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Reviewed-on: https://review.gerrithub.io/429228
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
This variable will be used for something more than
just building SPDK, namely installing the freebsd
contigmem kernel module.

Note: Installing a module requires root priviledges
and can't be done as a part of autobuild.

Change-Id: I45cc797493cc4ff22c1f8d0dd5e4e56642d54d11
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/429186
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
While here, change spdk_lib_list_to_files to
spdk_lib_list_to_static_libs to differentiate it from
the new spdk_lib_list_to_shared_libs.

Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: I6e5913addfbdd556fae2451d4e2b2c43feaf33ab

Reviewed-on: https://review.gerrithub.io/429286
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Lance Hartmann <lance.hartmann@oracle.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
This function doesn't return error code

Signed-off-by: Piotr Pelplinski <piotr.pelplinski@intel.com>
Change-Id: I67a8fa7393990470e509baa8934e78bc6f6a6c9e

Reviewed-on: https://review.gerrithub.io/429441
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Signed-off-by: Piotr Pelplinski <piotr.pelplinski@intel.com>
Change-Id: I875cc9d6a6bd1e9e9ac25ca9103a2070226ac236

Reviewed-on: https://review.gerrithub.io/428877
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
This patch sets optimal_io_boundary to cluster size, so that splitting
happens in bdev layer rather than blobstore layer.

Signed-off-by: Piotr Pelplinski <piotr.pelplinski@intel.com>
Change-Id: I0230cb4a188d605845a709e9c3c9061e822ef0f5

Reviewed-on: https://review.gerrithub.io/428065
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Paul Luse <paul.e.luse@intel.com>
Reviewed-by: Maciej Szwed <maciej.szwed@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Change-Id: I499f54b025080ad1916acc0cf265a58c806da002
Signed-off-by: Pawel Kaminski <pawelx.kaminski@intel.com>
Reviewed-on: https://review.gerrithub.io/428494
Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Paul Luse <paul.e.luse@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Change-Id: Idba8ad8afbf92c493d84271fd34443877993997a
Signed-off-by: shahar salzman <shahar.salzman@kaminario.com>
Reviewed-on: https://review.gerrithub.io/428305
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Only calling spdk_clear_all_transfer_task cannot solve all
the hotplug issue. The iSCSI task may successfully return
and own the bdev buffer inside the iSCSI task, so we need to
call this flush pdu function.

Change-Id: I255173d0880334e8acccc980a4ce04c380f64435
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Reviewed-on: https://review.gerrithub.io/428801
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reason: For connect, we use non_block mode in
the initiator side, but we do not do it for
the accepted fd in the server side, which will
casue writev not return. And this patch can fix this.

PS: SPDK default use non block mode.

Change-Id: I709574573a089c2e63ca079829945e864d9f20c2
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Reviewed-on: https://review.gerrithub.io/428654
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: GangCao <gang.cao@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Change-Id: I189ad8889c74937bf43bcf2c3029416ddb94976d
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
Reviewed-on: https://review.gerrithub.io/425705
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Xiaodong Liu <xiaodong.liu@intel.com>
Reviewed-by: Paul Luse <paul.e.luse@intel.com>
Reviewed-by: GangCao <gang.cao@intel.com>
With Identify Namespace Identification Descriptors can be
executed asynchronously, most of functions in the controller
initialization now can be executed asynchronously now, for
host with multiple controllers this can save some time during
initialization.

Change-Id: I70e3c6c2c691134d2ae4c5969288cced1538c6cc
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
Reviewed-on: https://review.gerrithub.io/428585
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: GangCao <gang.cao@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
This leaves files created by the root user in the directory and
makes future calls to make clean fail.

Change-Id: Ie33d0d33e8c01a2d17f6991284c5118b5bd545ff
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-on: https://review.gerrithub.io/429282
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
Previously, the difference between configuring and offline is
unclear, this patch just fixes it. The key difference should be
whether the raid bdev have ever registered. Offline is registered
before but unregistered now, and configuring has never registered.

According to the above, we should never set configuring raid bdev to
offline because it never got registered.

Change-Id: Id44ef6654e032993ffb8444e7e7ae3e43a9b0f16
Signed-off-by: wuzhouhui <wuzhouhui@kingsoft.com>
Reviewed-on: https://review.gerrithub.io/428321
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
If raid bdev creation failed, the bdev still be configuring and not
register it. For those raid bdev, raid_bdev_remove_base_bdev() should
cleanup them as well.

Change-Id: If2eda8ec80e7fdeb5e551fafe57a43a27ae0f9e6
Signed-off-by: wuzhouhui <wuzhouhui@kingsoft.com>
Reviewed-on: https://review.gerrithub.io/427331
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
This will resolve out-of-space errors that have cropped
up as SPDK continues to grow.  There's no need to copy
*.o files to the mounted filesystem - we 'make clean'
right after the rsync anyways.

Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: I6844183c527953fd4b3329f04171f05e503b04dc

Reviewed-on: https://review.gerrithub.io/429517
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Fix variable name added in patch:
https://review.gerrithub.io/#/c/spdk/spdk/+/429049/

Change-Id: I0349dfd16f784a0cc92ff64beae3389c1de8b55c
Signed-off-by: Pawel Niedzwiecki <pawelx.niedzwiecki@intel.com>
Reviewed-on: https://review.gerrithub.io/429485
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
This is a new feature for NVMEoF RDMA target, that is intended to save resource allocation
(by sharing them) and utilize the locality (completions and memory) to get the best
performance with Shared Receive Queues (SRQs). We'll create a SRQ per core (poll group),
per device and associate each created QP/CQ with an appropriate SRQ.

Our testing environment has 2 hosts.
Host 1:
  CPU: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz dual socket (8 cores total)
  Network: ConnectX-5, ConnectX-5 VPI , 100GbE, single-port QSFP28, PCIe3.0 x16
  Disk: Intel Optane SSD 900P Series
  OS: Fedora 27 x86_64
Host 2:
  CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz dual-socket (24 cores total)
  Network: ConnectX-4 VPI , 100GbE, dual-port QSFP28
  Disk: Intel Optane SSD 900P Series
  OS : CentOS 7.5.1804 x86_64
Hosts are connected via Spectrum switch.
Host 1 is running SPDK NVMeoF target. Host 2 is used as initiator running fio with SPDK plugin.

Configuration:
- SPDK NVMeoF target: cpu mask 0x0F (4 cores), max queue depth 128, max SRQ depth 1024, max QPs per controller 1024
- Single NVMf subsystem with single namespace backed by physical SSD disk
- fio with SPDK plugin: randread pattern, 1-256 jobs, block size 4k, IO depth 16, cpu_mask 0xFFF0, IO rate 10k, rate process “poisson”

Here is a full fio command line:
fio  --name=Job --stats=1 --group_reporting=1 --idle-prof=percpu --loops=1 --numjobs=1 --thread=1 --time_based=1 --runtime=30s --ramp_time=5s --bs=4k --size=4G --iodepth=16 --readwrite=randread --rwmixread=75 --randrepeat=1 --ioengine=spdk --direct=1 --gtod_reduce=0 --cpumask=0xFFF0 --rate_iops=10k --rate_process=poisson --filename=trtype=RDMA adrfam=IPv4 traddr=1.1.79.1 trsvcid=4420 ns=1

SPDK allocates the following entities for every work request in receive queue (shared or not): reqs (1024 bytes), recvs (96 bytes), cmds (64 bytes), cpls (16 bytes), in_capsule_buffer. All except the last one are fixed size. In capsule data size is configured to 4096.
Memory consumption calculation (target):
- Multiple SRQ: core_num * ib_devs_num * SRQ_depth * (1200 + in_capsule_data_size)
- Multiple RQ: queue_num * RQ_depth * (1200 + in_capsule_data_size)
We ignore admin queues in calculations for simplicity.

Cases:
1. Multiple SRQ with 1024 entries:
    - Mem = 4 * 1 * 1024 * (1200 + 4096) = 20.7 MiB (Constant number – does not depend on initiators number)
2. RQ with 128 entries for 64 initiators:
    - Mem = 64 * 128 * (1200 + 4096) = 41.4 MiB

Results:
FIO_JOBS       kIOPS       Bandwidth, MiB/s     AvgLatency, us     MaxResidentSize, kiB
           RQ       SRQ      RQ       SRQ       RQ         SRQ        RQ        SRQ
1          8.623    8.623    33.7     33.7      13.89      14.03      144376    155624
2          17.3     17.3     67.4     67.4      14.03      14.1       145776    155700
4          34.5     34.5     135      135       14.15      14.23      146540    156184
8          69.1     69.1     270      270       14.64      14.49      148116    156960
16         138      138      540      540       14.84      15.38      151216    158668
32         276      276      1079     1079      16.5       16.61      157560    161936
64         513      502      2005     1960      1673.31    1612.38    170408    168440
128        535      526      2092     2054      3329.79    3344.03    195796    181524
256        571      571      2232     2233      6854.57    6873.37    246484    207856

We can see the benefit in memory consumption.

The drawback of using SRQ is a risk of RNR errors when multiple clients initiate large number of IOs simultaneously.
In "RQ per QP" this is handled by Submission Queue flow control and RNR is not possible.
This patch does not contain any changes to solve RNR issue but we see at least two options here:
- try to increase the RNR retry count to more than 0 which is now hardcoded and make it a configurable parameter.
- implement some mechanism for dynamic SRQ extension. It may scale with number of IO queues or when it reaches some threshold.

Change-Id: I40c70f6ccbad7754918bcc6cb397e955b09d1033
Signed-off-by: Evgeniy Kochetov <evgeniik@mellanox.com>
AlekseyMarchuk pushed a commit that referenced this pull request Jul 2, 2019
For VMD driver we'll need to introduce some way of
iterating over all spdk pci device objects and we would
like to achieve that with simple spdk_pci_get_first_dev()/get_next_dev()
APIs. To make it thread safe though, we would have to
expose some public pci mutex to be locked around the
iteration and we don't want to do that, so we'll make
PCI APIs usable from only a single thread - this will
prevent any pci devices from being removed inbetween
subsequent get_first/get_next calls.

We currently have the following players accessing pci
device state:
 1) public APIs, obviously (on any thread right now)
 2) VFIO hotremove callback (dpdk interrupt thread)
 3) rte_eal_alarm for detaching rte_pci_devices (dpdk
    interrupt thread)
 4) DPDK hotplug IPC (dpdk interrupt thread)

There is g_pci_mutex providing the thread safety, but
even today it doesn't protect #3 and #4, making the
entire pci layer prone to data corruption.

To make #3 and #4 safe, we would have to lock inside
device init/fini callbacks (spdk_pci_device_init/fini),
but those are called directly inside the public device
attach/detach functions which already lock.

So now, with the decision to drop thread safety from
public pci APIs, we narrow down the locks inside public
functions and introduce locks inside those lower-level
init/fini callbacks.

Change-Id: I5dcbc9cdcbab65ee76cd3c42890f596069ec9a8a
Signed-off-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/c/spdk/spdk/+/458930
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
@EugeneKochetov EugeneKochetov deleted the dev/nvmf_offload branch December 12, 2019 07:57
allen-mlnx pushed a commit that referenced this pull request Aug 18, 2020
Not all JSON methods require 'params' field to be supplied.
Verification of the JSON is done on server side in
parse_single_request().

We should not attempt to process garbage values on correct
JSON config file during app start.

Segfault can be observed if following valid JSON config is supplied:
{
	"method": "framework_wait_init"
}
Resulting in:
json_config.c:388:13: runtime error: applying non-zero offset 18446744073709551600 to null pointer
AddressSanitizer:DEADLYSIGNAL
=================================================================
==3386067==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0000007260ff bp 0x7ffe6ea06890 sp 0x7ffe6ea067e0 T0)
==3386067==The signal is caused by a READ memory access.
==3386067==Hint: this fault was caused by a dereference of a high value address (see register values below).  Dissassemble the provided pc to learn which register was used.
    #0 0x7260ff in app_json_config_load_subsystem_config_entry /home/tzawadzk/spdk/lib/event/json_config.c:391
    #1 0x7cbb13 in msg_queue_run_batch /home/tzawadzk/spdk/lib/thread/thread.c:505
    #2 0x7cd00a in thread_poll /home/tzawadzk/spdk/lib/thread/thread.c:581
    #3 0x7cfe18 in spdk_thread_poll /home/tzawadzk/spdk/lib/thread/thread.c:689
    #4 0x71d6ef in _reactor_run /home/tzawadzk/spdk/lib/event/reactor.c:326
    #5 0x71eb00 in reactor_run /home/tzawadzk/spdk/lib/event/reactor.c:382
    #6 0x71f911 in spdk_reactors_start /home/tzawadzk/spdk/lib/event/reactor.c:477
    #7 0x718237 in spdk_app_start /home/tzawadzk/spdk/lib/event/app.c:691
    #8 0x407e94 in main /home/tzawadzk/spdk/app/spdk_tgt/spdk_tgt.c:120
    #9 0x7f0f2eef2041 in __libc_start_main ../csu/libc-start.c:308
    #10 0x4079ad in _start (/home/tzawadzk/spdk/build/bin/spdk_tgt+0x4079ad)

Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Change-Id: I7ef1a764467817ad788fdf5dbe17eaeb99dcc22e
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3256
Community-CI: Mellanox Build Bot
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
AlekseyMarchuk pushed a commit that referenced this pull request Oct 6, 2021
The bdev layer doesn't call the destruct callback until
all channels have been released, but because the channel
delete callback passes message to the main thread, we can
end up with a complicated race condition.  Currently we
have a deferred_free code path to handle this race, but
we can handle this a bit more cleanly by doing the
construct operation on the main_td as well.

This also simplifies the next patch which will
asynchronously destruct the bdev to fix an RPC bug.

Here's the race:

1) first channel was created on thread A, so disk->main_td = thread A
2) second channel was created on thread B
3) first channel is freed (but disk->main_td is still thread A)
4) spdk_bdev_unregister is called on thread C
5) bdev layer gives callback on thread B to upper layer
6) upper layer on thread B frees channel
7) bdev_rbd_destroy_cb runs on thread B and has to send msg to thread A
   for processing
8) bdev layer calls bdev_rbd_destruct on thread C (since step #4 was on
   thread C)

Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: I25ede2dc56e24dac0919aed05b9def2560823ee7
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/9158
Community-CI: Broadcom CI <spdk-ci.pdl@broadcom.com>
Community-CI: Mellanox Build Bot
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Changpeng Liu <changpeng.liu@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Ziye Yang <ziye.yang@intel.com>
EugeneKochetov pushed a commit that referenced this pull request May 29, 2022
The controller data structure may be freed before subsystem resume done
callback, we can take endpoint as the input parameter to avoid this issue.

AddressSanitizer: heap-use-after-free on address 0x625000046100 at pc 0x00000082818f bp 0x7fff7b09bd10 sp 0x7fff7b09bd00
READ of size 8 at 0x625000046100 thread T0 (reactor_0)
    #0 0x82818e in vfio_user_dev_quiesce_resume_done /spdk/lib/nvmf/vfio_user.c:2147
    #1 0x782cc0 in subsystem_state_change_done /spdk/lib/nvmf/subsystem.c:634
    #2 0xad047b in _call_completion /spdk/lib/thread/thread.c:2344
    #3 0xabc48d in msg_queue_run_batch /spdk/lib/thread/thread.c:710
    #4 0xac0670 in thread_poll /spdk/lib/thread/thread.c:926
    #5 0xac0ead in spdk_thread_poll /spdk/lib/thread/thread.c:986
    #6 0x9a5b4f in _reactor_run /spdk/lib/event/reactor.c:920
    #7 0x9a6442 in reactor_run /spdk/lib/event/reactor.c:958
    #8 0x9a717c in spdk_reactors_start /spdk/lib/event/reactor.c:1060
    #9 0x99884a in spdk_app_start /spdk/lib/event/app.c:643
    #10 0x407e82 in main /spdk/app/nvmf_tgt/nvmf_main.c:75
    #11 0x7f822095ff42 in __libc_start_main (/lib64/libc.so.6+0x23f42)
    #12 0x407abd in _start (/spdk/build/bin/nvmf_tgt+0x407abd)

0x625000046100 is located 0 bytes inside of 8320-byte region [0x625000046100,0x625000048180)
freed by thread T0 (reactor_0) here:
    #0 0x7f82219ff91f in __interceptor_free (/lib64/libasan.so.5+0x10d91f)
    #1 0x837059 in _free_ctrlr /spdk/lib/nvmf/vfio_user.c:2976
    #2 0x837327 in free_ctrlr /spdk/lib/nvmf/vfio_user.c:2996
    #3 0x843541 in nvmf_vfio_user_close_qpair /spdk/lib/nvmf/vfio_user.c:3742
    #4 0x7d1d91 in nvmf_transport_qpair_fini /spdk/lib/nvmf/transport.c:604
    #5 0x7ad922 in _nvmf_qpair_destroy /spdk/lib/nvmf/nvmf.c:1055
    #6 0x761362 in nvmf_qpair_request_cleanup /spdk/lib/nvmf/ctrlr.c:4026
    #7 0x761906 in spdk_nvmf_request_free /spdk/lib/nvmf/ctrlr.c:4041
    #8 0x75a931 in nvmf_qpair_free_aer /spdk/lib/nvmf/ctrlr.c:3576
    #9 0x7ae626 in spdk_nvmf_qpair_disconnect /spdk/lib/nvmf/nvmf.c:1127
    #10 0x83db36 in _vfio_user_qpair_disconnect /spdk/lib/nvmf/vfio_user.c:3433
    #11 0xabc48d in msg_queue_run_batch /spdk/lib/thread/thread.c:710
    #12 0xac0670 in thread_poll /spdk/lib/thread/thread.c:926
    #13 0xac0ead in spdk_thread_poll /spdk/lib/thread/thread.c:986
    #14 0x9a5b4f in _reactor_run /spdk/lib/event/reactor.c:920
    #15 0x9a6442 in reactor_run /spdk/lib/event/reactor.c:958
    #16 0x9a717c in spdk_reactors_start /spdk/lib/event/reactor.c:1060
    #17 0x99884a in spdk_app_start /spdk/lib/event/app.c:643
    #18 0x407e82 in main /spdk/app/nvmf_tgt/nvmf_main.c:75
    #19 0x7f822095ff42 in __libc_start_main (/lib64/libc.so.6+0x23f42)

previously allocated by thread T0 (reactor_0) here:
    #0 0x7f82219fff16 in __interceptor_calloc (/lib64/libasan.so.5+0x10df16)
    #1 0x837413 in nvmf_vfio_user_create_ctrlr /spdk/lib/nvmf/vfio_user.c:3010
    #2 0x83bc68 in nvmf_vfio_user_accept /spdk/lib/nvmf/vfio_user.c:3313
    #3 0xabfbd8 in thread_execute_timed_poller /spdk/lib/thread/thread.c:872
    #4 0xac0c75 in thread_poll /spdk/lib/thread/thread.c:960
    #5 0xac0ead in spdk_thread_poll /spdk/lib/thread/thread.c:986
    #6 0x9a5b4f in _reactor_run /spdk/lib/event/reactor.c:920
    #7 0x9a6442 in reactor_run /spdk/lib/event/reactor.c:958
    #8 0x9a717c in spdk_reactors_start /spdk/lib/event/reactor.c:1060
    #9 0x99884a in spdk_app_start /spdk/lib/event/app.c:643
    #10 0x407e82 in main /spdk/app/nvmf_tgt/nvmf_main.c:75
    #11 0x7f822095ff42 in __libc_start_main (/lib64/libc.so.6+0x23f42)

SUMMARY: AddressSanitizer: heap-use-after-free /spdk/lib/nvmf/vfio_user.c:2147 in vfio_user_dev_quiesce_resume_done

Change-Id: Icf5e5b360b9107a3c5eb960ae59b7fe10ace1c66
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/11420
Community-CI: Broadcom CI <spdk-ci.pdl@broadcom.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Dong Yi <dongx.yi@intel.com>
Reviewed-by: John Levon <levon@movementarian.org>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
yshestakov pushed a commit that referenced this pull request Sep 19, 2023
Ubsan with clang complains when using spdk_ioviter with more iters than
declared in the array:

  iov.c:69:9: runtime error: index 3 out of bounds for type 'struct spdk_single_ioviter[2]'
  #0 0x5df709 in spdk_ioviter_firstv /home/vagrant/spdk_repo/spdk/lib/util/iov.c:69:9
  #1 0x53780b in raid5f_xor_stripe /home/vagrant/spdk_repo/spdk/module/bdev/raid/raid5f.c:270:24
  #2 0x531bd8 in raid5f_submit_write_request /home/vagrant/spdk_repo/spdk/module/bdev/raid/raid5f.c:520:2
  #3 0x52a03a in raid5f_submit_rw_request /home/vagrant/spdk_repo/spdk/module/bdev/raid/raid5f.c:596:9
  #4 0x548c17 in test_raid5f_write_request /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:550:2
  #5 0x544e18 in test_raid5f_submit_rw_request /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:714:3
  #6 0x553e61 in __test_raid5f_submit_full_stripe_write_request /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:878:3
  #7 0x543f84 in run_for_each_raid5f_config /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:748:3
  #8 0x527ac1 in test_raid5f_submit_full_stripe_write_request /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:885:2
  #9 0x7f4a71a0960a  (/usr/lib64/libcunit.so.1+0x460a) (BuildId: 9c82dd336cbccd99721651ac0a04435e746e0fc0)
  #10 0x7f4a71a09937  (/usr/lib64/libcunit.so.1+0x4937) (BuildId: 9c82dd336cbccd99721651ac0a04435e746e0fc0)
  #11 0x7f4a71a0a897 in CU_run_all_tests (/usr/lib64/libcunit.so.1+0x5897) (BuildId: 9c82dd336cbccd99721651ac0a04435e746e0fc0)
  #12 0x524fe8 in main /home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut.c:1006:2
  #13 0x7f4a711d750f in __libc_start_call_main (/usr/lib64/libc.so.6+0x2750f) (BuildId: 81daba31ee66dbd63efdc4252a872949d874d136)
  #14 0x7f4a711d75c8 in __libc_start_main@GLIBC_2.2.5 (/usr/lib64/libc.so.6+0x275c8) (BuildId: 81daba31ee66dbd63efdc4252a872949d874d136)
  #15 0x4235b4 in _start (/home/vagrant/spdk_repo/spdk/test/unit/lib/bdev/raid/raid5f.c/raid5f_ut+0x4235b4) (BuildId: 028d075edd1a7cd17881fd678ef076adfdbac13d)

Fix this by making iters a zero-length array and put it in a union with a
two-element array to keep the default size for compatibility.

Change-Id: I8573b015755e9986cdadbfa1705d269d51a7c2b7
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/18402
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Community-CI: Mellanox Build Bot
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com>
EugeneKochetov pushed a commit that referenced this pull request Apr 9, 2024
As per typedef in nvme.h the spdk_nvme_cpl argument should be a
pointer to a const struct.

This fixes runtimer error under clang >= 17.x which now makes the
-fsanitize=function available for C and which on our end is being
enabled via -fsanitize=undefined under UBSAN.

Error in question:

 Test: test_spdk_nvme_detach ...passed
  Test: test_nvme_completion_poll_cb ...passed
  Test: test_nvme_user_copy_cmd_complete
.../root/spdk/lib/nvme/nvme.c:417:2: runtime error: call to function
dummy_cb through pointer to incorrect function type 'void (*)(void *,
const struct spdk_nvme_cpl *)'
/root/spdk/test/unit/lib/nvme/nvme.c/nvme_ut.c:584: note: dummy_cb
defined here
    #0 0x5098e0 in nvme_user_copy_cmd_complete
       /root/spdk/lib/nvme/nvme.c:417:2
    #1 0x532161 in test_nvme_user_copy_cmd_complete
       /root/spdk/test/unit/lib/nvme/nvme.c/nvme_ut.c:604:2
    #2 0x7f08c952266a  (/usr/lib64/libcunit.so.1+0x466a) (BuildId:
       d99e3b60795f2ce01ada820b4b7e3cd84d8150fe)
    #3 0x7f08c95229c7  (/usr/lib64/libcunit.so.1+0x49c7) (BuildId:
       d99e3b60795f2ce01ada820b4b7e3cd84d8150fe)
    #4 0x7f08c9523a9f in CU_run_all_tests
       (/usr/lib64/libcunit.so.1+0x5a9f) (BuildId:
d99e3b60795f2ce01ada820b4b7e3cd84d8150fe)
    #5 0x55555e in run_tests /root/spdk/lib/ut/ut.c:169:3
    #6 0x552aec in spdk_ut_run_tests /root/spdk/lib/ut/ut.c:225:8
    #7 0x522d52 in main
       /root/spdk/test/unit/lib/nvme/nvme.c/nvme_ut.c:1664:17
    #8 0x7f08c8c28149 in __libc_start_call_main
       (/usr/lib64/libc.so.6+0x28149) (BuildId:
7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
    #9 0x7f08c8c2820a in __libc_start_main@GLIBC_2.2.5
       (/usr/lib64/libc.so.6+0x2820a) (BuildId:
7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
    #10 0x42b6a4 in _start
	(/root/spdk/test/unit/lib/nvme/nvme.c/nvme_ut+0x42b6a4)
(BuildId: 6fc2caaf777030becad2d0f660ec68443f3380b4)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
/root/spdk/lib/nvme/nvme.c:417:2 in
./test/unit/unittest.sh: line 85: 75349 Aborted                 (core
dumped) $valgrind $testdir/lib/nvme/nvme.c/nvme_ut

Change-Id: Iddbd5fc0dee0ef6a6cc1f032e079f6119e76aed9
Signed-off-by: Michal Berger <michal.berger@intel.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/22025
Reviewed-by: Jim Harris <jim.harris@samsung.com>
Community-CI: Mellanox Build Bot
Reviewed-by: Konrad Sztyber <konrad.sztyber@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
EugeneKochetov pushed a commit that referenced this pull request Aug 12, 2024
This is a cumulative patch that addresses the comments on the mainline
SPDK CR ([1]) and specifically - the naming related one ([2]).

[1] https://review.spdk.io/gerrit/c/spdk/spdk/+/22511/
[2] https://review.spdk.io/gerrit/c/spdk/spdk/+/22511/comment/296bf45a_eabb58af/

Change-Id: I78ab7b8409ff7be513c1fb18af32c8637500a848
Signed-off-by: Anton Nayshtut <anayshtut@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet