Performance questions regarding RING_MSG and socket send/recv #1109

sithhell · 2024-03-27T20:44:40Z

sithhell
Mar 27, 2024

Hi,

I am working on writing a backend for communicating over Unix Domain Sockets and wanted to implement it using liburing as the features presented sound very appealing.

Nevertheless, I am starting to pull my hair because I cannot achieve the performance I was hoping for. Both in terms of single client throughput and scalability.

I've gone through multiple iterations already ... nevertheless, I wanted to start out new and follow all best practices (fixed files, multishot requests etc.) from https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023.

So here is my general idea:

One process is serving as both a client and a server.
The "client" and "server" connections can be considered persistent throughout the application run.
The "server" side is accepting connect requests (through io_uring_prep_multishot_accept_direct) and is then receiving (through io_uring_prep_recv_multishot). Once a single message is complete, a response is sent. This shoudl later be used to invoke a RPC.
Accept, receive and send does not require any interference from another thread is completely handled in the IO uring event handling loop.
When a client wants to invoke a request however, it is initiated from a different thread. The idea is to use io_uring_prep_ring_msg to notify the event handling loop, and then initiate the send and receive part. Since the "client" is persistent, the receive is again handled with a multishot request.
All requests are being issued with IOSQE_FIXED_FILE.

I hope this makes sense so far ... Attached is the code to reproduce the issue.
It can be compiled as follows:
g++ -std=c++20 uring_repro.cpp -I$HOME/.local/include $HOME/.local/lib/liburing.a -g -O3
uring_repro.txt

Now, one of the most interesting things for me is throughput and latency, especially for small-ish requests (say under 1KB).
The test itself is exchanging 128 byte of data and is able to switch between synchronous send/recv for the client and dispatching it to io_uring (via the command line paramer client_sync).

I am running the following kernel:

6.7.9-200.fc39.x86_64

Here is the performance result:

$ ./a.out num_clients 1 duration 1
Clients took 1.001352 seconds
Number of requests: 84196
Requests/s: 84082.343878
$ ./a.out num_clients 1 duration 1 client_sync
Clients took 1.001067 seconds
Number of requests: 173320
Requests/s: 173135.211577

The non io_uring client is almost twice as fast as when using io_uring. I did not expect that!

Do you have any explanation for this behavior?

When profiling the example, I can see that in the client_sync case, io_uring_submit_and_wait spends around 13% inside schedule (in the kernel), and the full ioring version is spending around 25% inside this kernel function. Is this a red herring?

Any further input is highly appreciated!

Regards,
Thomas

sithhell · 2024-03-28T08:31:40Z

sithhell
Mar 28, 2024
Author

Forgot to mention: I am using liburing 2.5.

0 replies

sithhell · 2024-03-28T10:38:15Z

sithhell
Mar 28, 2024
Author

Out of curiosity, I plotted the time each individual invoke (for 1 million repetitions):

What's interesting here is that the uring variants have higher spikes. Also, once uring is involved, there seems to be a general 'offset' for the processing time.

Here's also a box plot of the same measurement:

4 replies

sithhell Mar 28, 2024
Author

Including the updated version.
uring_repro.txt

After reading through the issues again, i tried kernel 6.9rc1 on an AWS c5d.2xlarge instance with liburing commit baeaccd. Here is where things get really strange:

ubuntu@ip-10-0-0-93:~$ g++-12 -std=c++23 uring_repro.cpp -luring -O3 && ./a.out 
Clients took 29.407606 seconds
Requests/s: 34004.807766
ubuntu@ip-10-0-0-93:~$ g++-12 -std=c++23 uring_repro.cpp -luring -O3 && ./a.out client_sync
Clients took 16.761817 seconds
Requests/s: 59659.404291
ubuntu@ip-10-0-0-93:~$ g++-12 -std=c++23 uring_repro.cpp -luring -O3 && ./a.out client_sync server_sync
Clients took 10.600894 seconds
Requests/s: 94331.669547
ubuntu@ip-10-0-0-93:~$ uname -a
Linux ip-10-0-0-93 6.9.0-060900rc1-generic #202403242136 SMP PREEMPT_DYNAMIC Sun Mar 24 21:49:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

sithhell Mar 28, 2024
Author

Doesn't seem to be related to the kernel version:

ubuntu@ip-10-0-0-93:~$ uname -a
Linux ip-10-0-0-93 6.5.0-1016-aws #16~22.04.1-Ubuntu SMP Wed Mar 13 18:54:49 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@ip-10-0-0-93:~$ ./a.out 
Clients took 28.630877 seconds
Requests/s: 34927.326683
ubuntu@ip-10-0-0-93:~$ ./a.out client_sync
Clients took 15.735132 seconds
Requests/s: 63552.058459
ubuntu@ip-10-0-0-93:~$ ./a.out client_sync server_sync
Clients took 10.622009 seconds
Requests/s: 94144.150937

sithhell Mar 28, 2024
Author

Hmmm. Running perf stat gives the following:

ubuntu@ip-10-0-0-93:~$ perf stat ./a.out
Clients took 31.505923 seconds
Requests/s: 31740.064096

 Performance counter stats for './a.out':

         29,425.48 msec task-clock                       #    0.934 CPUs utilized             
         4,999,768      context-switches                 #  169.913 K/sec                     
               321      cpu-migrations                   #   10.909 /sec                      
               150      page-faults                      #    5.098 /sec                      
   <not supported>      cycles                                                                
   <not supported>      instructions                                                          
   <not supported>      branches                                                              
   <not supported>      branch-misses                                                         

      31.515985075 seconds time elapsed

       4.017061000 seconds user
      29.042672000 seconds sys


ubuntu@ip-10-0-0-93:~$ perf stat ./a.out client_sync
Clients took 16.996363 seconds
Requests/s: 58836.117309

 Performance counter stats for './a.out client_sync':

         16,896.93 msec task-clock                       #    0.994 CPUs utilized             
         2,864,903      context-switches                 #  169.552 K/sec                     
                35      cpu-migrations                   #    2.071 /sec                      
               148      page-faults                      #    8.759 /sec                      
   <not supported>      cycles                                                                
   <not supported>      instructions                                                          
   <not supported>      branches                                                              
   <not supported>      branch-misses                                                         

      17.005931854 seconds time elapsed

       2.134832000 seconds user
      16.545849000 seconds sys


ubuntu@ip-10-0-0-93:~$ perf stat ./a.out client_sync server_sync
Clients took 10.255869 seconds
Requests/s: 97505.147197

 Performance counter stats for './a.out client_sync server_sync':

         12,102.87 msec task-clock                       #    1.179 CPUs utilized             
         2,054,655      context-switches                 #  169.766 K/sec                     
                23      cpu-migrations                   #    1.900 /sec                      
               141      page-faults                      #   11.650 /sec                      
   <not supported>      cycles                                                                
   <not supported>      instructions                                                          
   <not supported>      branches                                                              
   <not supported>      branch-misses                                                         

      10.267745588 seconds time elapsed

       1.694272000 seconds user
      11.687650000 seconds sys

So twice as many syscalls ... Yet I was hoping to reduce the number of syscalls, or even keep them on par with the "standard" version.
This is I guess the next step...

sithhell Mar 28, 2024
Author

The number of context switches led me the path that io_uring_submit_and_wait might be the culprit.
Switching to io_uring_submit_and_get_events dramatically changes the picture:

ubuntu@ip-10-0-0-93:~$ sudo perf stat ./a.out
Clients took 11.564789 seconds
Requests/s: 86469.367302

 Performance counter stats for './a.out':

         34,695.13 msec task-clock                       #    2.998 CPUs utilized             
                62      context-switches                 #    1.787 /sec                      
                 8      cpu-migrations                   #    0.231 /sec                      
               148      page-faults                      #    4.266 /sec                      
   <not supported>      cycles                                                                
   <not supported>      instructions                                                          
   <not supported>      branches                                                              
   <not supported>      branch-misses                                                         

      11.573523593 seconds time elapsed

      15.131528000 seconds user
      19.559036000 seconds sys


ubuntu@ip-10-0-0-93:~$ sudo perf stat ./a.out client_sync
Clients took 9.662693 seconds
Requests/s: 103490.817479

 Performance counter stats for './a.out client_sync':

         24,135.88 msec task-clock                       #    2.495 CPUs utilized             
         1,307,595      context-switches                 #   54.176 K/sec                     
                13      cpu-migrations                   #    0.539 /sec                      
               148      page-faults                      #    6.132 /sec                      
   <not supported>      cycles                                                                
   <not supported>      instructions                                                          
   <not supported>      branches                                                              
   <not supported>      branch-misses                                                         

       9.674060612 seconds time elapsed

      12.672538000 seconds user
      12.195578000 seconds sys


ubuntu@ip-10-0-0-93:~$ sudo perf stat ./a.out client_sync server_sync
Clients took 10.363741 seconds
Requests/s: 96490.253522

 Performance counter stats for './a.out client_sync server_sync':

         12,197.04 msec task-clock                       #    1.176 CPUs utilized             
         2,038,592      context-switches                 #  167.138 K/sec                     
                16      cpu-migrations                   #    1.312 /sec                      
               142      page-faults                      #   11.642 /sec                      
   <not supported>      cycles                                                                
   <not supported>      instructions                                                          
   <not supported>      branches                                                              
   <not supported>      branch-misses                                                         

      10.368294380 seconds time elapsed

       1.761491000 seconds user
      11.759639000 seconds sys

I however don't understand why there is such a significant difference here. Everything I've read so far made me think io_uring_submit_and_wait should be the way to got...

Here is the patch to the last repro...

--- uring_repro.cpp	2024-03-28 13:04:35.627074345 +0100
+++ uring_repro_new.cpp	2024-03-28 14:31:53.564323309 +0100
@@ -8,18 +8,18 @@
 #include <sys/socket.h>
 #include <sys/un.h>
 
+#include <bit>
 #include <chrono>
 #include <cstring>
 #include <list>
 #include <mutex>
+#include <optional>
 #include <source_location>
+#include <span>
 #include <system_error>
 #include <thread>
 #include <utility>
 #include <vector>
-#include <bit>
-#include <span>
-#include <optional>
 
 std::size_t num_requests = 1'000'000;
 bool client_sync{false};
@@ -80,6 +80,7 @@
         if (sqe) {
             return sqe;
         }
+        fprintf(stderr, "WOOOT?\n");
         ::io_uring_submit(ring);
     } while (true);
 }
@@ -385,7 +386,7 @@
         // Start the event loop...
         static constexpr int min_complete = 1;
         while (!stop_.load(std::memory_order_acquire)) {
-            int num_completed = ::io_uring_submit_and_wait(&ring_, min_complete);
+            int num_completed = ::io_uring_submit_and_get_events(&ring_);
             // Ignore EINTR...
             if (num_completed == -EINTR) {
                 num_completed = 0;
@@ -502,9 +503,21 @@
     }
 
     void submit_and_wait(int n) {
-        ::io_uring_submit_and_wait(&ring_, n);
-        // Replenish CQEs
-        io_uring_cq_advance(&ring_, n);
+        while (n > 0) {
+            ::io_uring_submit_and_get_events(&ring_);
+            // Replenish CQEs
+            ::io_uring_cqe* cqe;
+            unsigned head;
+            unsigned i = 0;
+            io_uring_for_each_cqe(&ring_, head, cqe) {
+                ++i;
+            }
+            if (i == 0) {
+                return;
+            }
+            io_uring_cq_advance(&ring_, i);
+            n -= i;
+        }
     }
 
     ::io_uring ring_;

axboe · 2024-03-28T16:22:29Z

axboe
Mar 28, 2024
Maintainer

I did take a quick look at this yesterday, but mostly from the perspective of MSG_RING not being particularly performant. But I don't think that's really the case here, it's more of a scheduler benchmark than anything else due to the local sockets. Eg both would perform much better if just pinned to a single core.

There seems to be a bit of confusion here too in terms of context switches vs syscalls. A syscall is not necessarily a context switch, that only happens if the task ends up sleeping. For example, when you submit your pending IO and then wait for some, and the task sleeps ever so briefly before being woken to run the task_work and then ultimately return the 1 event you waited for. I also suspect that's why the change to io_uring_submit_and_get_events() makes a difference, because you're now no longer sleeping on a new event. So you'd probably end up doing more syscalls, but less context switches.

1 reply

sithhell Mar 28, 2024
Author

Correct. I figured this much as well. Also after revisiting the answer provided here: #993.

io_uring_submit_and_get_events() basically never ends up in the kernel which leads to my benchmark more or less busy polling on the completions. This is not really what I want either, since in the end, my particular application will oversubscribe the entire machine.

You are most certainly correct on context switch vs syscall. Thanks for the clarification.

You are also correct that MSG_RING does not seem to be the particular problem here.

Which leaves the scheduling part...
There's also the comment here https://github.com/axboe/liburing/blob/master/examples/proxy.c#L1904 which I don't really understand in that context. How would the timeout reduce the number of context switches? When removing DEFER_TASKRUN, I still roughly get the same amount of context switches.

Pinning the test to a single core is indeed improving the performance:

$ taskset -c 4 ./a.out
Clients took 8.559473 seconds
Requests/s: 116829.617726
$ taskset -c 4 ./a.out client_sync
Clients took 5.282793 seconds
Requests/s: 189293.813292
$ taskset -c 4 ./a.out client_sync server_sync
Clients took 5.321304 seconds
Requests/s: 187923.859658

That is, the example is doing fine when just the "server" part is handled by iouring but a soon as the "client" dispatches to the handler thread, there's the gap again.

Is this a limitation of iouring in general or can this be avoided somehow given that particular workload?

axboe · 2024-03-29T00:02:47Z

axboe
Mar 29, 2024
Maintainer

Can you put your updated test app somewhere in a repo? Just so we're both looking at the same thing. One other odd thing I noticed is that you set POLLFIRST for sends, that's probably not a good idea in general as we expect data to be there. Just a side remark, it's not (really) related to any slowdowns, it's just more overhead for nothing. I strongly suspect it's some kind of scheduling side effect, but I'll find some time next week to really dive into this one.

For reducing context switches, it's not related to the timeout, it's related to the number you wait for. With DEFER_TASKRUN, if you wait for eg 8 completions, then it won't wake the task at all to process the completions until you have 8 of them. That can greatly reduce the context switches. Without DEFER_TASKRUN, this isn't really possible, and you end up getting woken up for each completion coming in regardless of how many you are waiting for.

1 reply

sithhell Mar 29, 2024
Author

Can you put your updated test app somewhere in a repo?

The code is here: https://github.com/sithhell/uring_test
The test app is ping_pong and a stripped down version only having the MSG_RING part is msg_ring (not discussed so far).

One other odd thing I noticed is that you set POLLFIRST for sends, that's probably not a good idea in general as we expect data to be there. Just a side remark, it's not (really) related to any slowdowns, it's just more overhead for nothing.

Fair point. Mostly I was thinking: "What can go wrong?" The docs state that it will not go into polling mode if an data is already there. There is no significant difference when removing this flag though.

I strongly suspect it's some kind of scheduling side effect, but I'll find some time next week to really dive into this one.

Thank you! Much appreciated!

For reducing context switches, it's not related to the timeout, it's related to the number you wait for. With DEFER_TASKRUN, if you wait for eg 8 completions, then it won't wake the task at all to process the completions until you have 8 of them. That can greatly reduce the context switches. Without DEFER_TASKRUN, this isn't really possible, and you end up getting woken up for each completion coming in regardless of how many you are waiting for.

Is this context switch in terms of transitioning into the kernel? Then yes, this makes sense. Thanks!

cmazakas · 2024-04-18T14:36:14Z

cmazakas
Apr 18, 2024

Sorry for the super late reply here but I think I have some insights that'll help dramatically speed things up here.

    for (std::size_t i = 0; i != num_messages; ++i)
    {
        auto* sqe = get_sqe(&dispatch_ring);
        auto msg_data = encode_userdata(operation_type::ring_msg, &msg);
        ::io_uring_prep_msg_ring(sqe, event_ring.ring_fd, dispatch_ring.ring_fd, msg_data, 0);
        sqe->flags = IOSQE_CQE_SKIP_SUCCESS;
        ::io_uring_submit_and_wait(&dispatch_ring, 1);
        handle_events(&dispatch_ring);
    }

Submission and reaping should be in decoupled loops. What happens here is, we submit-then-wait which isn't fast. Fill the submission queue until full then hit io_uring_submit_and_get_events();.

Then CQEs can be reaped later in a separate loop.

It also seems like you're using relatively shallow CQ sizes. I'd suggest increasing the size of your CQ. Maybe try 32k or higher and see if it has tangible benefits.

1 reply

sithhell May 7, 2024
Author

Thanks for the suggestion! Unfortunately, filling up the queue is not an option here. The code is basically a mockup for a RPC library. If a piece of code wants to perfom an RPC, I cannot delay it indefinitely.
I have tried io_uring_submit_and_get_events, but this is essentially just busy waiting, which I would like to avoid as well.
Increasing the CQ size does not make a whole lot of difference either.

YoSTEALTH · 2024-04-19T00:25:12Z

YoSTEALTH
Apr 19, 2024

Rather than io_uring_submit_and_wait, you could do:

io_uring_submit(ring)
cq_ready = io_uring_cq_ready(ring)
if cq_ready:
    # do stuff
else: # wait
    io_uring_wait_cqe_nr(ring, cqe, 1)

This way you are only waiting when there is no activity.

1 reply

sithhell May 7, 2024
Author

Thank you for your suggestion! Unfortunately, it didn't help anything.

sithhell · 2024-05-16T12:36:46Z

sithhell
May 16, 2024
Author

So I spent a little more time with this tiny example ...
In addition to the io_uring based implementation, I also attempted an edge triggered epoll example.
The result here was mind blowing. This implementation was a factor of 4 faster than the 'sync' implementation.

I applied the same technique to the io_uring implementation:

Changing the socket to non-blocking
Only submit a recv/send request to the uring if the operation would block.

What I found interesting here is that changing the send part of the code alone to the 'edge-triggered' approach did not help.
I needed to change the multishot receive to a single shot, and eagerly depleting the receive buffer. This way, the scheduling overhead went away.
The effect was of course, that during the active ping pong, nothing was submitted to the ring anymore... so I was not testing liburing anymore and effectively the epoll and uring implementation led to the very same code execution (non blocking calls to send/recv always succeeding).

I guess this is the way forward for me now... and hope the problem is mitigated that way once the system is under load and the recv/send calls would block.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance questions regarding RING_MSG and socket send/recv #1109

{{title}}

Replies: 7 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance questions regarding RING_MSG and socket send/recv #1109

sithhell Mar 27, 2024

Replies: 7 comments · 8 replies

sithhell Mar 28, 2024 Author

sithhell Mar 28, 2024 Author

sithhell Mar 28, 2024 Author

sithhell Mar 28, 2024 Author

sithhell Mar 28, 2024 Author

sithhell Mar 28, 2024 Author

axboe Mar 28, 2024 Maintainer

sithhell Mar 28, 2024 Author

axboe Mar 29, 2024 Maintainer

sithhell Mar 29, 2024 Author

cmazakas Apr 18, 2024

sithhell May 7, 2024 Author

YoSTEALTH Apr 19, 2024

sithhell May 7, 2024 Author

sithhell May 16, 2024 Author

sithhell
Mar 27, 2024

Replies: 7 comments 8 replies

sithhell
Mar 28, 2024
Author

sithhell
Mar 28, 2024
Author

sithhell Mar 28, 2024
Author

sithhell Mar 28, 2024
Author

sithhell Mar 28, 2024
Author

sithhell Mar 28, 2024
Author

axboe
Mar 28, 2024
Maintainer

sithhell Mar 28, 2024
Author

axboe
Mar 29, 2024
Maintainer

sithhell Mar 29, 2024
Author

cmazakas
Apr 18, 2024

sithhell May 7, 2024
Author

YoSTEALTH
Apr 19, 2024

sithhell May 7, 2024
Author

sithhell
May 16, 2024
Author