librdmacm/cmtime: Add multi-thread support #1451

shefty · 2024-04-23T18:38:52Z

The primary objective of this series is to add multi-thread support to the cmtime app. Multi-threading is added by abstracting and generalizing the work queue abstraction used by cmtime. (If there's a usable version of a work queue suitable for rdma-core use, I can switch to that if needed.) The series also adds:

Ability to 'mimic' QP operations, by replacing QP calls with sleeps. This allows testing at scales where QP resources may not be available.
Add out-of-band synchronization mechanism to enable more precise timing of different testing steps.
Remove incomplete, complex, and unnecessary error handling. Just abort test on unexpected errors.

The final version of cmtime could use a review to verify that it is capturing the data as intended.

Add a command line option that suspends the executing thread for a given number of microseconds instead of calling a QP operation. This provides a rough simulation of the HW interactions to create and modify a QP without allocating the HW resources. The purpose of this change is to allow testing the CM behavior at scales that are impractical or impossible given system constraints. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Restructure the code to move client and server code from main into isolated function. A subsequent patch will add more operations to both. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Add a simple socket connection to sync the client and server during testing. When scaling up tests, CM issues are exposed in ways that are unexpected. :) Yes, the problems are being exposed, but earlier and in more difficult ways to analyze. Example: A client finishes connecting by sending an RTU message. It then immediately starts disconnecting. This sends DREQs. The problem is that the number of messages seen by the server is now twice as many as the number of connections. The result is that DREQs get dropped, which results in retries, and the client and server move out of sync. So, although the test shows scaling for connecting, it starts to break when disconnecting. The actual synchronization during the test is added as a separate change. This patch just lays the foundation. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Replace the started != completed syntax with a simpler check that completed != iterations to perform. This removes the started variable. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Instead of using a hidden global variable that's set to the iteration parameter, use the iter parameter directly and pass it through to sub-functions. This makes the code easier to read. Note that the global connections variable is still needed after this patch in order to track asynchronous events that need to be processed. However, its use is reduced. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Update the server, so that it processes events using a separate thread, similar to the client. This allows the main thread to continue as a control thread, which is then used to synchronize the testing with the client. By synchronizing the the tests, the server can now know what events it expects to receive. This enables the server to track the start and end of the connect and disconnect phases. Prior to adding synchronization, disconnect events could have been generated before all connections completed. (For example. the RTU from the client was lost, requiring that it be retried. During that window, the client could start the disconnect process. This made the number of events that the server would actually see unknown in the case packets were lost.) Update the code to track the start and end of the phases. That data feeds into the control thread determining when to sync. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Isolate the work queue abstraction from the cmtime. Define a work_item structure to allow individual work items to invoke separate callbacks. Add support for multiple threads. Move the code from the cmtime source into the common source for future reusability. This change will enable cmtime to use a single work queue to process different steps in the connection process. Having a single work queue will limit the number of threads that the test will need to spawn. Signed-off-by: Sean Hefty <shefty@nvidia.com>

To simplify reading the code, move the trivial event handler functions into the main event handler call. This helps make it clear that the event thread should just process the event directly, rather than queuing to a worker thread. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Have the client use a work queue to multi-thread the QP creation, address resolution, route resolution, and QP modify operations. Signed-off-by: Sean Hefty <shefty@nvidia.com>

The test tries to check for connection errors and continue if one occurs. However, the client and server must be in synch regarding the number of connections. Plus, since the goal of the test is provide accurate timings of CM steps, if an error occurs, the timings are thrown off. Finally, the error handling is insufficient for what it attempts to do. Just exit the test on error after printing a message, and avoid giving a reader the idea that errors are handled properly. Signed-off-by: Sean Hefty <shefty@nvidia.com>

After the client sends a disconnect request to the server, have it wait for the server to sync using the OOB mechanism. As the number of connections to test gets close to 1000, it's frequent that the DREP is not making it back to the DREQ. The result is that the DREQ must time out completely before the client can proceed. Note that this appears to be exposing undesirable behavior from the kernel CM regarding duplicate DREQ handling. However, because the timeouts are so long, it impacts the ability to execute the test to collect connection setup timings. Signed-off-by: Sean Hefty <shefty@nvidia.com>

A subsequent patch will enable multiple threads to update the completed counters in parallel. Switch to using atomics. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Once multi-thread support is added to cmtime, the use_qpn and disc_events variables will be access by multiple threads. Use an atomic variable instead of volatiles. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Add a command line argument to set the number of threads assigned to a work queue. Signed-off-by: Sean Hefty <shefty@nvidia.com>

When multiple work items are added to the work queue, it's possible that only a single thread is signaled to process the work. The result is that the other threads continue to wait on the condition. Since the test inserts work using a loop, this occurs frequently, and the behavior appears as a singled thread app, even if multiple threads were requested. After removing a work item, if other work items remain, signal the condition to wake up another thread. Signed-off-by: Sean Hefty <shefty@nvidia.com>

Remove : to make it easier to import the data into excel. Move the avg/iter forward, since it's the more useful value to examine. Include headers with some test details. Signed-off-by: Sean Hefty <shefty@nvidia.com>

To provide a reasonable baseline, add an option to connect over tcp sockets instead of RDMA. (The author was unable to find a suitable connection rate test for sockets. However, since this is just to provide something to compare against, the code is fairly simple.) Signed-off-by: Sean Hefty <shefty@nvidia.com>

shefty · 2024-04-25T20:41:45Z

Added a patch that tests tcp socket connection rate, which can be used as a baseline.

shefty added 16 commits April 12, 2024 12:03

librdmacm/cmtime: Move client/server code into functions

0629ff0

Restructure the code to move client and server code from main into isolated function. A subsequent patch will add more operations to both. Signed-off-by: Sean Hefty <shefty@nvidia.com>

librdmacm/cmtime: Change how test iteration stages are tracked

21af2b3

Replace the started != completed syntax with a simpler check that completed != iterations to perform. This removes the started variable. Signed-off-by: Sean Hefty <shefty@nvidia.com>

librdmacm/cmtime: Have client use work queue threads

ca92ba0

Have the client use a work queue to multi-thread the QP creation, address resolution, route resolution, and QP modify operations. Signed-off-by: Sean Hefty <shefty@nvidia.com>

librdmacm/cmtime: Convert completed count to atomic

fa89ab6

A subsequent patch will enable multiple threads to update the completed counters in parallel. Switch to using atomics. Signed-off-by: Sean Hefty <shefty@nvidia.com>

librdmacm/cmtime: Use atomic for 'use_qpn' and 'disc_events'

78a4f91

Once multi-thread support is added to cmtime, the use_qpn and disc_events variables will be access by multiple threads. Use an atomic variable instead of volatiles. Signed-off-by: Sean Hefty <shefty@nvidia.com>

librdmacm/cmtime: Add option to control number of threads

79ab39a

Add a command line argument to set the number of threads assigned to a work queue. Signed-off-by: Sean Hefty <shefty@nvidia.com>

librdmacm/cmtime: Make output easier to import into excel

0dc13ac

Remove : to make it easier to import the data into excel. Move the avg/iter forward, since it's the more useful value to examine. Include headers with some test details. Signed-off-by: Sean Hefty <shefty@nvidia.com>

shefty force-pushed the cmtime2 branch from fdb4d50 to 0dc13ac Compare April 23, 2024 18:57

rleon merged commit de9dc13 into linux-rdma:master May 8, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

librdmacm/cmtime: Add multi-thread support #1451

librdmacm/cmtime: Add multi-thread support #1451

shefty commented Apr 23, 2024

shefty commented Apr 25, 2024

librdmacm/cmtime: Add multi-thread support #1451

librdmacm/cmtime: Add multi-thread support #1451

Conversation

shefty commented Apr 23, 2024

shefty commented Apr 25, 2024