Replies: 7 comments 8 replies
-
Forgot to mention: I am using liburing 2.5. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I did take a quick look at this yesterday, but mostly from the perspective of MSG_RING not being particularly performant. But I don't think that's really the case here, it's more of a scheduler benchmark than anything else due to the local sockets. Eg both would perform much better if just pinned to a single core. There seems to be a bit of confusion here too in terms of context switches vs syscalls. A syscall is not necessarily a context switch, that only happens if the task ends up sleeping. For example, when you submit your pending IO and then wait for some, and the task sleeps ever so briefly before being woken to run the task_work and then ultimately return the 1 event you waited for. I also suspect that's why the change to io_uring_submit_and_get_events() makes a difference, because you're now no longer sleeping on a new event. So you'd probably end up doing more syscalls, but less context switches. |
Beta Was this translation helpful? Give feedback.
-
Can you put your updated test app somewhere in a repo? Just so we're both looking at the same thing. One other odd thing I noticed is that you set POLLFIRST for sends, that's probably not a good idea in general as we expect data to be there. Just a side remark, it's not (really) related to any slowdowns, it's just more overhead for nothing. I strongly suspect it's some kind of scheduling side effect, but I'll find some time next week to really dive into this one. For reducing context switches, it's not related to the timeout, it's related to the number you wait for. With DEFER_TASKRUN, if you wait for eg 8 completions, then it won't wake the task at all to process the completions until you have 8 of them. That can greatly reduce the context switches. Without DEFER_TASKRUN, this isn't really possible, and you end up getting woken up for each completion coming in regardless of how many you are waiting for. |
Beta Was this translation helpful? Give feedback.
-
Sorry for the super late reply here but I think I have some insights that'll help dramatically speed things up here. for (std::size_t i = 0; i != num_messages; ++i)
{
auto* sqe = get_sqe(&dispatch_ring);
auto msg_data = encode_userdata(operation_type::ring_msg, &msg);
::io_uring_prep_msg_ring(sqe, event_ring.ring_fd, dispatch_ring.ring_fd, msg_data, 0);
sqe->flags = IOSQE_CQE_SKIP_SUCCESS;
::io_uring_submit_and_wait(&dispatch_ring, 1);
handle_events(&dispatch_ring);
} Submission and reaping should be in decoupled loops. What happens here is, we submit-then-wait which isn't fast. Fill the submission queue until full then hit Then CQEs can be reaped later in a separate loop. It also seems like you're using relatively shallow CQ sizes. I'd suggest increasing the size of your CQ. Maybe try 32k or higher and see if it has tangible benefits. |
Beta Was this translation helpful? Give feedback.
-
Rather than io_uring_submit(ring)
cq_ready = io_uring_cq_ready(ring)
if cq_ready:
# do stuff
else: # wait
io_uring_wait_cqe_nr(ring, cqe, 1) This way you are only waiting when there is no activity. |
Beta Was this translation helpful? Give feedback.
-
So I spent a little more time with this tiny example ... I applied the same technique to the io_uring implementation:
What I found interesting here is that changing the send part of the code alone to the 'edge-triggered' approach did not help. I guess this is the way forward for me now... and hope the problem is mitigated that way once the system is under load and the recv/send calls would block. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am working on writing a backend for communicating over Unix Domain Sockets and wanted to implement it using liburing as the features presented sound very appealing.
Nevertheless, I am starting to pull my hair because I cannot achieve the performance I was hoping for. Both in terms of single client throughput and scalability.
I've gone through multiple iterations already ... nevertheless, I wanted to start out new and follow all best practices (fixed files, multishot requests etc.) from https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023.
So here is my general idea:
io_uring_prep_multishot_accept_direct
) and is then receiving (throughio_uring_prep_recv_multishot
). Once a single message is complete, a response is sent. This shoudl later be used to invoke a RPC.io_uring_prep_ring_msg
to notify the event handling loop, and then initiate the send and receive part. Since the "client" is persistent, the receive is again handled with a multishot request.IOSQE_FIXED_FILE
.I hope this makes sense so far ... Attached is the code to reproduce the issue.
It can be compiled as follows:
g++ -std=c++20 uring_repro.cpp -I$HOME/.local/include $HOME/.local/lib/liburing.a -g -O3
uring_repro.txt
Now, one of the most interesting things for me is throughput and latency, especially for small-ish requests (say under 1KB).
The test itself is exchanging 128 byte of data and is able to switch between synchronous send/recv for the client and dispatching it to io_uring (via the command line paramer
client_sync
).I am running the following kernel:
Here is the performance result:
The non io_uring client is almost twice as fast as when using io_uring. I did not expect that!
Do you have any explanation for this behavior?
When profiling the example, I can see that in the
client_sync
case,io_uring_submit_and_wait
spends around 13% insideschedule
(in the kernel), and the full ioring version is spending around 25% inside this kernel function. Is this a red herring?Any further input is highly appreciated!
Regards,
Thomas
Beta Was this translation helpful? Give feedback.
All reactions