-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io_uring benchmarks don't max CPU #39
Comments
Sure, I can help with the kernel upgrades. Just hit me up and we can do this. |
Kernel 5.5.0-050500-generic has been installed on asp-perf-lin. We can do benchmarks against I also installed today's master of liburing, just in case. |
That's great! I need to make a change similar to #37, and then we can run some benchmarks. |
@adamsitnik is this the machine you can perfcollect? |
Yes :D @tmds what benchmarks do you want me to run now? |
done in #43 Let's see how
|
--path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
|
@tmds I've re-run the benchmarks with tracing enabled. Files uploaded to g drive as usuall. --path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
|
Thanks Adam!
|
Can you also run it for |
Shouldn't we also try |
Setting We haven't done much benchmarking with |
By adding this logging logic, I get the following printed with
|
The first request is synchronous because data is already available. Next requests will wait for data and cause receives to move to the epoll/io_uring thread. Check what happens when you use |
as usual, traces uploaded to g drive ;) --path "/json" --arg "-e=iouring" --arg "-t=1" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=1" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=2" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=2" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=5" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=5" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=6" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=6" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=7" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=7" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=8" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=8" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
|
Putting these results in a table:
We're unable to max CPU with io_uring. I hope we can identify the cause in the trace files. @adamsitnik can you share the trace folder also with @antonfirsov , and with @lpereira in case he also wants to take a look? I'll look at the trace files next week. |
Anton already has the access. I'll try to take a look as well (around Monday). |
@adamsitnik something went wrong with some of the trace files. Usually they are 60MB in size. Some are <10MB and don't show anything useful in PerfView. I've opened up the 03-05-19-29-23.RPS-421K trace. In our code we set submission queue length to 512:
Most operations take 2 submission entries (1 to poll for ready, and 1 to do the operation). So with 256 connections handled by 1 io_uring thread it could be a bottleneck if all submissions are made together. With multiple threads, it should not be a bottleneck. We can increase the value just in case. Let me know what you make of this... |
It would be interesting to benchmark using https://github.com/tkp1n/IoUring.Transport and see if it can fully load the benchmark machine (100% cpu). We can add it as an option to the test/web app. |
Might be a good idea to try a newer kernel, too. Although io_uring was released with 5.5.0, there has been numerous changes and improvements since then (either adding new opcodes, or performance enhancements). I'd try with 5.6-rc5. |
@tmds good idea, I will give it a try |
With https://github.com/tmds/Tmds.LinuxAsync/pull/56/files we can benchmark with Let's see if it also shows a bottleneck.
|
Yes, it's something we need to explore also. If you check cpu on the thread pool threads (which are performing the sends), you should see they are not 100% loaded. So this is not causing the bottleneck we see with io_uring. I try to focus on benchmarks that tell us something specific at this point. We also need to do overall benchmarks that combine multiple options and find out what is the best we can achieve. |
I will try running a few new benchmarks, I think I have all the information now to do so. |
I managed to run the requested benchmarks, results and traces are available here: Summary:
Looks like IoUring.Transport outperforms our io_uring implementation, but still not as good as our best epoll results. I also tried moving sends to the uring thread by naively setting |
Thank you for running benchmarks Anton.
A From the benchmark results on sharepoint, I see |
@tkp1n you might be also interested in the results. |
A value of 1 indicates a program that on average consumes all the CPU from a single processor. 3 means that we are using 100% of the 3 CPUs |
@tmds @adamsitnik when running the benchmarks, on We should probably increase |
Had a quick chat with @sebastienros, we should utilize BenchmarkDriver2 and the Json-Ulib scenario to get this right. Will look into it tomorrow. /CC @alnikola @adamsitnik. |
We want the load machine to not have a full CPU load while running the benchmarks. This means it is able to provide the required load without being a bottleneck.
|
I had another look at the trace files. I noticed the same things as mentioned in #39 (comment). |
As mentioned in axboe/liburing#97 (comment) we should not use I'll address this in IoUring.Transport (as it's easier for me to change) and make it available for re-testing. If we see full CPU usage in the benchmarks, we should adapt the io_uring implementation here as well. @tmds your thoughts on this? |
Let's verify first if this now maxes CPU. |
@tkp1n please let us know when you're done, I'm happy to redo the benchmarks. Do you plan to expose a parameter on |
A very unstable first draft of the IoUring.Transport was already written not to use LINK. I'll revert to this logic (minus the bugs hopefully) on master without adding an option at first. Adding an option would mean major code duplication or delay for the time I need to come up with a proper abstraction to avoid the duplication... Please let me know if an option would make things significantly easier on your side though! |
@antonfirsov I just pushed a commit to master to remove all intsances of Let's hope for full CPU utilization :) If the results are non-obvious, I'll add the discussed option to let us switch between the two approaches quickly. |
I have executed PlatformBenchmarks json benchmark in all configurations, and it looks like CPU is now utilized:
|
Probably a switch that disable Another option is to test IORING_FEAT_FAST_POLL feature flag, so that we can choose the best method |
@antonfirsov is the Did you use Sockets-based Transport from this repo? Can we see how Sockets-based Transport from this repo using AIO, compares to IoUring master?
|
Given above test results from @antonfirsov I think a switch to enable A switch to disable Kernel version 5.7 seems to provide a much easier programming model for network developers and also better perf. I could also live with a decision not to use |
These are the different ways we are polling:
@adamsitnik I never mentioned explicitly the difference between |
@adamsitnik @antonfirsov I'd like to run benchmarks from #39 (comment) so we have an idea of the potential gain of io_uring compared to AIO at 5.5 kernel. @antonfirsov do you want to update the implementation so poll+operation are no longer batched? It should be possible to localize changes to |
Please run them for a range of t=1..ProcessorCount. Especially IoUringTransport needs a higher t because the sends are on io_uring thread. |
Yes I've been running "raw" PlatformBenchmarks in to make the comparison fair. I will run all requested benchmarks from this repo in 1-2 hours.
Yes, I'll do that, hopefully by the end of the day. |
@tmds the last one is very surprising to me. When running PlatformBenchmarks with the default setup (t=12), setting |
Can you run these benchmarks for You can also do benchmarks for both |
I was unsure if you meant physical or logical RPS values:
Latency values:
|
Ooh, these numbers are starting to look interesting.
…On Wed, Mar 18, 2020, 16:59 Anton Firszov ***@***.***> wrote:
I was not sure if you meant physical or logical ProcessorCount-s, so did
it for t=1..12.
RPS values:
t epoll epoll+i iourt iourt+i
1 323,062 120,800 123,420 121,355
2 395,237 172,156 205,454 224,303
3 444,085 228,824 308,281 330,000
4 436,653 257,696 434,057 394,257
5 438,405 308,799 467,698 450,207
6 452,817 317,180 469,067 518,186
7 423,527 319,580 459,151 500,609
8 428,509 320,677 458,801 492,271
9 401,790 334,822 456,361 481,208
10 410,495 342,311 453,251 476,992
11 391,220 331,554 453,505 486,638
12 416,274 338,851 446,474 483,669
[image: image]
<https://user-images.githubusercontent.com/6835152/77018111-9a12b580-697c-11ea-9553-f5d4acdf4296.png>
Latency values:
t epoll epoll+i iourt iourt+i
1 0.36 2.12 2.24 2.22
2 0.79 1.49 1.33 1.18
3 1.03 1.14 0.92 0.81
4 1.09 1.02 0.77 0.68
5 1.29 1.03 1 0.59
6 1.12 1.46 1.31 0.83
7 1.53 1.87 1.55 0.76
8 1.53 2.07 1.61 0.77
9 1.66 2.26 1.7 0.7
10 1.72 2.41 1.74 0.74
11 1.72 2.67 1.52 0.77
12 1.24 2.59 1.44 0.7
[image: image]
<https://user-images.githubusercontent.com/6835152/77018164-d0e8cb80-697c-11ea-88e3-4b27f21544fc.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADVGL42OXBJ3PQYKSFU53RIFN7FANCNFSM4K4UM43A>
.
|
Looks like it doesn't worth to go above the number of physical cores ... at least on this HW. |
Yes, we shouldn't abandon io_uring at 5.5 kernel. Let's see how it behaves when we use io_uring in Socket implementation.
|
See #67 (comment). epoll+AIO and io_uring show similar performance. The iouring transport has better performance. Note that the transport is batching sends, while the Socket based implementations are not. It would be nice if we can make |
… / SMT Benchmarks have shown that it is not profitable to go beyond the number of physical cores: tmds/Tmds.LinuxAsync#39 (comment)
@tmds I guess you could close axboe/liburing#97 as well... even though we only saw 96% utilization yet and never 99% 😉 |
The io_uring based implementation needs a kernel 5.5, which isn't available on benchmark infrastructure.
Is it feasible to install kernel 5.5? It would be my preference because that is also what end-users should have for io_uring usage.
If it is more easy to use kernel 5.4 on the benchmark machines, we can make some modifications to the implementation. This is mainly to deal with the lack of
IORING_FEAT_SUBMIT_STABLE
which allows some memory to be re-used as soon as the request is submitted, and not have to keep it pinned until the operation completes.(Lack of
IORING_FEAT_NODROP
means the implementation must not be used in production, but that is not a stopper for benchmarking.)@sebastienros, what is doable?
@lpereira and @antonfirsov may be able to assist with kernel upgrades.
cc @adamsitnik
The text was updated successfully, but these errors were encountered: