Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_uring benchmarks don't max CPU #39

Closed
tmds opened this issue Feb 27, 2020 · 60 comments · Fixed by #67
Closed

io_uring benchmarks don't max CPU #39

tmds opened this issue Feb 27, 2020 · 60 comments · Fixed by #67

Comments

@tmds
Copy link
Owner

tmds commented Feb 27, 2020

The io_uring based implementation needs a kernel 5.5, which isn't available on benchmark infrastructure.

Is it feasible to install kernel 5.5? It would be my preference because that is also what end-users should have for io_uring usage.

If it is more easy to use kernel 5.4 on the benchmark machines, we can make some modifications to the implementation. This is mainly to deal with the lack of IORING_FEAT_SUBMIT_STABLE which allows some memory to be re-used as soon as the request is submitted, and not have to keep it pinned until the operation completes.
(Lack of IORING_FEAT_NODROP means the implementation must not be used in production, but that is not a stopper for benchmarking.)

@sebastienros, what is doable?
@lpereira and @antonfirsov may be able to assist with kernel upgrades.

cc @adamsitnik

@lpereira
Copy link

Sure, I can help with the kernel upgrades. Just hit me up and we can do this.

@antonfirsov
Copy link
Collaborator

Kernel 5.5.0-050500-generic has been installed on asp-perf-lin. We can do benchmarks against IOUringAsyncEngine now.

I also installed today's master of liburing, just in case.

@tmds
Copy link
Owner Author

tmds commented Mar 2, 2020

That's great!

I need to make a change similar to #37, and then we can run some benchmarks.

@tmds
Copy link
Owner Author

tmds commented Mar 2, 2020

asp-perf-lin

@adamsitnik is this the machine you can perfcollect?
If not, @antonfirsov can you update that machine too?

@adamsitnik
Copy link
Collaborator

is this the machine you can perfcollect?

Yes :D

@tmds what benchmarks do you want me to run now?

@tmds
Copy link
Owner Author

tmds commented Mar 2, 2020

I need to make a change similar to #37, and then we can run some benchmarks.

done in #43

Let's see how AIO and io_uring compare.
The following moves ProcessReceives to the epoll/io_uring thread.
It includes a w=true (wait-for-data-available, not batcheable by AIO), and a w=false (no wait-for-data-available).
I've included a t=4 and a t=3.

--path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256

@adamsitnik
Copy link
Collaborator

--path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           449,132
Max CPU (%):                 99
WorkingSet (MB):             179
Avg. Latency (ms):           1.21
Startup (ms):                2
First Request (ms):          86.23
Latency (ms):                1.13
Total Requests:              6,781,866
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             7,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           415,036
Max CPU (%):                 82
WorkingSet (MB):             194
Avg. Latency (ms):           0.94
Startup (ms):                2
First Request (ms):          92.97
Latency (ms):                0.59
Total Requests:              6,258,530
Duration: (ms)               15,080
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,502
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           380,911
Max CPU (%):                 99
WorkingSet (MB):             185
Avg. Latency (ms):           1.71
Startup (ms):                2
First Request (ms):          94.69
Latency (ms):                0.54
Total Requests:              5,751,787
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           354,067
Max CPU (%):                 84
WorkingSet (MB):             206
Avg. Latency (ms):           1.21
Startup (ms):                2
First Request (ms):          95.47
Latency (ms):                0.42
Total Requests:              5,346,124
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,502
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           453,778
Max CPU (%):                 99
WorkingSet (MB):             178
Avg. Latency (ms):           0.98
Startup (ms):                2
First Request (ms):          94.86
Latency (ms):                0.24
Total Requests:              6,850,137
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           421,350
Max CPU (%):                 85
WorkingSet (MB):             179
Avg. Latency (ms):           0.87
Startup (ms):                2
First Request (ms):          92.67
Latency (ms):                0.47
Total Requests:              6,362,188
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           442,319
Max CPU (%):                 99
WorkingSet (MB):             178
Avg. Latency (ms):           1.22
Startup (ms):                2
First Request (ms):          90.57
Latency (ms):                0.46
Total Requests:              6,678,806
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           361,560
Max CPU (%):                 87
WorkingSet (MB):             213
Avg. Latency (ms):           0.87
Startup (ms):                2
First Request (ms):          91.88
Latency (ms):                0.7
Total Requests:              5,459,224
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,001
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1

@adamsitnik adamsitnik mentioned this issue Mar 5, 2020
@adamsitnik
Copy link
Collaborator

@tmds I've re-run the benchmarks with tracing enabled. Files uploaded to g drive as usuall.

--path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           456,252
Max CPU (%):                 99
WorkingSet (MB):             187
Avg. Latency (ms):           1.14
Startup (ms):                2
First Request (ms):          100.81
Latency (ms):                0.36
Total Requests:              6,887,495
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             6,502
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           423,139
Max CPU (%):                 82
WorkingSet (MB):             197
Avg. Latency (ms):           0.92
Startup (ms):                2
First Request (ms):          111.49
Latency (ms):                0.29
Total Requests:              6,389,048
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,502
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
The specified server url 'http://asp-perf-lin:5001' is invalid or not responsive.
RequestsPerSecond:           444,370
Max CPU (%):                 99
WorkingSet (MB):             207
Avg. Latency (ms):           1.68
Startup (ms):                2
First Request (ms):          108.78
Latency (ms):                0.37
Total Requests:              6,700,780
Duration: (ms)               15,080
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
The specified server url 'http://asp-perf-lin:5001' is invalid or not responsive.
RequestsPerSecond:           351,740
Max CPU (%):                 85
WorkingSet (MB):             190
Avg. Latency (ms):           1.15
Startup (ms):                2
First Request (ms):          117.08
Latency (ms):                0.39
Total Requests:              5,311,294
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,001
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           446,739
Max CPU (%):                 99
WorkingSet (MB):             202
Avg. Latency (ms):           0.89
Startup (ms):                2
First Request (ms):          115.15
Latency (ms):                0.24
Total Requests:              6,745,718
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,001
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           420,723
Max CPU (%):                 84
WorkingSet (MB):             205
Avg. Latency (ms):           0.88
Startup (ms):                2
First Request (ms):          120.45
Latency (ms):                0.34
Total Requests:              6,352,910
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,001
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=epoll" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           447,583
Max CPU (%):                 99
WorkingSet (MB):             191
Avg. Latency (ms):           1.12
Startup (ms):                2
First Request (ms):          111.27
Latency (ms):                0.41
Total Requests:              6,758,385
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,502
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           357,530
Max CPU (%):                 87
WorkingSet (MB):             206
Avg. Latency (ms):           0.87
Startup (ms):                2
First Request (ms):          118.6
Latency (ms):                0.61
Total Requests:              5,398,674
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,001
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1

@tmds
Copy link
Owner Author

tmds commented Mar 5, 2020

Thanks Adam!
It seems we're not maxing CPU with iouring, which is surprising.
Let's increase t, CPU should go up, and RPS also (until we're at 100%).

--path "/json" --arg "-e=iouring" --arg "-t=5" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=5" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=6" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=6" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=7" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=7" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=8" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=8" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256

@tmds
Copy link
Owner Author

tmds commented Mar 5, 2020

Can you also run it for t=2 and t=1?

@antonfirsov
Copy link
Collaborator

antonfirsov commented Mar 5, 2020

Shouldn't we also try -s=true -r=true with io_uring to eliminate the extra syscalls when send() or recv() returns EWOULDBLOCK?

@tmds
Copy link
Owner Author

tmds commented Mar 5, 2020

Setting c=false is enough to move the receive calls to the epoll/io_uring thread and cause them to be batched. This gives us an idea of how io_uring and AIO compare for batching.

We haven't done much benchmarking with s=true and I'm not sure how well it behaves.

@antonfirsov
Copy link
Collaborator

antonfirsov commented Mar 6, 2020

By adding this logging logic, I get the following printed with -c false:

Trying to run Accept synchronously.
Trying to run Accept synchronously.
Trying to run Accept synchronously.
Trying to run Accept synchronously.
Trying to run Receive synchronously.
Calling recv
Trying to run Receive synchronously.
Calling recv
Trying to run Receive synchronously.
Calling recv
Trying to run Accept synchronously.
Trying to run Send synchronously.
Calling send

-s true -r true disables the synchronous attempts.

@tmds
Copy link
Owner Author

tmds commented Mar 6, 2020

The first request is synchronous because data is already available. Next requests will wait for data and cause receives to move to the epoll/io_uring thread. Check what happens when you use wrk.

@adamsitnik
Copy link
Collaborator

as usual, traces uploaded to g drive ;)

--path "/json" --arg "-e=iouring" --arg "-t=1" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           259,132
Max CPU (%):                 76
WorkingSet (MB):             186
Avg. Latency (ms):           1
Startup (ms):                17
First Request (ms):          140.84
Latency (ms):                0.28
Total Requests:              3,912,789
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             8,504
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=1" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           159,883
Max CPU (%):                 73
WorkingSet (MB):             203
Avg. Latency (ms):           1.6
Startup (ms):                2
First Request (ms):          98.68
Latency (ms):                0.54
Total Requests:              2,414,115
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,502
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=2" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           422,083
Max CPU (%):                 87
WorkingSet (MB):             199
Avg. Latency (ms):           0.91
Startup (ms):                2
First Request (ms):          118.14
Latency (ms):                0.59
Total Requests:              6,373,081
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,502
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=2" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           307,507
Max CPU (%):                 89
WorkingSet (MB):             202
Avg. Latency (ms):           0.87
Startup (ms):                2
First Request (ms):          115.13
Latency (ms):                0.47
Total Requests:              4,643,100
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             5,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=5" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           381,922
Max CPU (%):                 80
WorkingSet (MB):             207
Avg. Latency (ms):           1.14
Startup (ms):                1
First Request (ms):          118.84
Latency (ms):                0.45
Total Requests:              5,767,051
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=5" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           342,477
Max CPU (%):                 84
WorkingSet (MB):             200
Avg. Latency (ms):           1.22
Startup (ms):                2
First Request (ms):          117.69
Latency (ms):                0.52
Total Requests:              5,171,343
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,502
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=6" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           386,634
Max CPU (%):                 78
WorkingSet (MB):             193
Avg. Latency (ms):           0.92
Startup (ms):                2
First Request (ms):          112.25
Latency (ms):                0.36
Total Requests:              5,837,899
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=6" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           354,809
Max CPU (%):                 83
WorkingSet (MB):             193
Avg. Latency (ms):           0.96
Startup (ms):                2
First Request (ms):          116.29
Latency (ms):                0.46
Total Requests:              5,357,356
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=7" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           300,031
Max CPU (%):                 75
WorkingSet (MB):             202
Avg. Latency (ms):           1.04
Startup (ms):                2
First Request (ms):          116.05
Latency (ms):                0.41
Total Requests:              4,530,453
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=7" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
RequestsPerSecond:           336,500
Max CPU (%):                 82
WorkingSet (MB):             202
Avg. Latency (ms):           0.99
Startup (ms):                2
First Request (ms):          118.31
Latency (ms):                0.47
Total Requests:              5,080,931
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=8" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
RequestsPerSecond:           359,407
Max CPU (%):                 75
WorkingSet (MB):             186
Avg. Latency (ms):           0.98
Startup (ms):                2
First Request (ms):          108.09
Latency (ms):                0.31
Total Requests:              5,427,057
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,002
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1
--path "/json" --arg "-e=iouring" --arg "-t=8" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
The specified server url 'http://asp-perf-lin:5001' is invalid or not responsive.
RequestsPerSecond:           339,171
Max CPU (%):                 82
WorkingSet (MB):             203
Avg. Latency (ms):           0.97
Startup (ms):                2
First Request (ms):          114.03
Latency (ms):                0.49
Total Requests:              5,121,465
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,502
Published Size (KB):         1,062
SDK:                         5.0.100-preview.1.20106.1
Runtime:                     5.0.0-preview.2.20154.1
ASP.NET Core:                5.0.0-preview.2.20155.1

@tmds
Copy link
Owner Author

tmds commented Mar 6, 2020

Putting these results in a table:

RPS aio   io_uring   io_uring (cpu)
t wait no wait wait no wait no wait
1     159883 259132 76
2     307507 422083 87
3 447583 446739 357530 420723 82
4 444370 456252 351740 423139 84
5     342477 381922 80
6     386634 381922 78
7     336500 300031 75
8     339171 359407 75

We're unable to max CPU with io_uring. I hope we can identify the cause in the trace files.
We do see io_uring uses less cpu per request.

@adamsitnik can you share the trace folder also with @antonfirsov , and with @lpereira in case he also wants to take a look?

I'll look at the trace files next week.

@adamsitnik
Copy link
Collaborator

Anton already has the access. I'll try to take a look as well (around Monday).

@tmds
Copy link
Owner Author

tmds commented Mar 9, 2020

@adamsitnik something went wrong with some of the trace files. Usually they are 60MB in size. Some are <10MB and don't show anything useful in PerfView.

I've opened up the 03-05-19-29-23.RPS-421K trace.
.NET io_uring threads have a Metric/Interval of 0.50.
.NET ThreadPool thread of 0.62.
There is also a io_wqe_worker-1 thread, I think this is the kernel work queue used by io_uring. That one has a Metric/Interval of 2.78. What does a value > 1 mean? Maybe this represents multiple threads? Only 1/4 of samples involve doing work (io_worker_handle_work). A lot of samples show idle handling (do_idle).

In our code we set submission queue length to 512:

private const int SubmissionQueueRequestedLength = 512; // TODO

Most operations take 2 submission entries (1 to poll for ready, and 1 to do the operation). So with 256 connections handled by 1 io_uring thread it could be a bottleneck if all submissions are made together. With multiple threads, it should not be a bottleneck. We can increase the value just in case.

Let me know what you make of this...

@tmds
Copy link
Owner Author

tmds commented Mar 9, 2020

It would be interesting to benchmark using https://github.com/tkp1n/IoUring.Transport and see if it can fully load the benchmark machine (100% cpu). We can add it as an option to the test/web app.

@lpereira
Copy link

lpereira commented Mar 9, 2020

Might be a good idea to try a newer kernel, too. Although io_uring was released with 5.5.0, there has been numerous changes and improvements since then (either adding new opcodes, or performance enhancements). I'd try with 5.6-rc5.

@antonfirsov
Copy link
Collaborator

@tmds good idea, I will give it a try

@tmds
Copy link
Owner Author

tmds commented Mar 10, 2020

With https://github.com/tmds/Tmds.LinuxAsync/pull/56/files we can benchmark with IoUring.Transport also.

Let's see if it also shows a bottleneck.
@adamsitnik can you run these benchmarks?

--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--path "/json" --arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256
--path "/json" --arg "-e=iouringtransport" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256

@antonfirsov
Copy link
Collaborator

With trace.03-05-19-29-23.RPS-421K.trace.zip, which has been collected with the following parameters, as far as I understand:

--path "/json" --arg "-e=iouring" --arg "-t=3" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256

I see a significant amount of time being sent in send:
image

Shouldn't we try batching sends on the io_uring thread?

@tmds
Copy link
Owner Author

tmds commented Mar 11, 2020

Shouldn't we try batching sends on the io_uring thread?

Yes, it's something we need to explore also.

If you check cpu on the thread pool threads (which are performing the sends), you should see they are not 100% loaded. So this is not causing the bottleneck we see with io_uring.

I try to focus on benchmarks that tell us something specific at this point. We also need to do overall benchmarks that combine multiple options and find out what is the best we can achieve.

@antonfirsov
Copy link
Collaborator

I will try running a few new benchmarks, I think I have all the information now to do so.

@antonfirsov
Copy link
Collaborator

antonfirsov commented Mar 11, 2020

I managed to run the requested benchmarks, results and traces are available here:
https://microsoft-my.sharepoint.com/:f:/p/anfirszo/EkqER0svzQlJsYVH9EXVkosBElKc2q1K2wFI4TtC4L1mpg?e=hBMq37

Summary:

ARGS RPS
--arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256 413,995
--arg "-e=iouring" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=true" --connections 256 348,040
--arg "-e=iouring" --arg "-t=4" --arg "-s=true" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256 386,336
--arg "-e=iouringtransport" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256 435,823

Looks like IoUring.Transport outperforms our io_uring implementation, but still not as good as our best epoll results. I also tried moving sends to the uring thread by naively setting -s=true, but that introduced extra blocking on the thread pool:

image

@tmds
Copy link
Owner Author

tmds commented Mar 11, 2020

Thank you for running benchmarks Anton.

Looks like IoUring.Transport outperforms our io_uring implementation, but still not as good as our best epoll results. I also tried moving sends to the uring thread by naively setting -s=true, but that introduced extra blocking on the thread pool:

A Transport has the advantage it is a simpler abstraction than Socket, so it can always perform a bit better.
To batch sends, I think using the io_uring thread as PipeScheduler will probably give best results, see #49 (comment).

From the benchmark results on sharepoint, I see iouringtransport has Max CPU of 81% so it also encounters a bottleneck that prevents it from using all CPU.

@antonfirsov
Copy link
Collaborator

@tkp1n you might be also interested in the results.

@adamsitnik
Copy link
Collaborator

What does a value > 1 mean

A value of 1 indicates a program that on average consumes all the CPU from a single processor. 3 means that we are using 100% of the 3 CPUs

@antonfirsov
Copy link
Collaborator

antonfirsov commented Mar 12, 2020

@tmds @adamsitnik when running the benchmarks, on asp-perf-load wrk tops around 750% on a 12 core machine (should be 1200% I guess), so load generation is likely a bottleneck.

We should probably increase --client-threads and --connections.

@antonfirsov
Copy link
Collaborator

Had a quick chat with @sebastienros, we should utilize BenchmarkDriver2 and the Json-Ulib scenario to get this right.

Will look into it tomorrow. /CC @alnikola @adamsitnik.

@tmds
Copy link
Owner Author

tmds commented Mar 13, 2020

@tmds @adamsitnik when running the benchmarks, on asp-perf-load wrk tops around 750% on a 12 core machine (should be 1200% I guess), so load generation is likely a bottleneck.

We want the load machine to not have a full CPU load while running the benchmarks. This means it is able to provide the required load without being a bottleneck.

750% could be a bottleneck if --client-threads 7 or 8. If --client-threads is 12, the machine is probably not the bottleneck we're looking for. If you run htop during the benchmark you can visually verify whether the load spreads across all CPUs, and uses 60-80% per CPU.

@tmds
Copy link
Owner Author

tmds commented Mar 16, 2020

I had another look at the trace files. I noticed the same things as mentioned in #39 (comment).
I then looked at the 414k trace file from #39 (comment).
The difference between these files should be t=3 vs t=4. The latter trace doesn't have a io_wqe_worker-1, and idle handling is in the 'swapper' instead. I don't know why changing t has this effect.

@tkp1n
Copy link
Contributor

tkp1n commented Mar 17, 2020

As mentioned in axboe/liburing#97 (comment) we should not use IOSQE_IO_LINK.

I'll address this in IoUring.Transport (as it's easier for me to change) and make it available for re-testing. If we see full CPU usage in the benchmarks, we should adapt the io_uring implementation here as well.

@tmds your thoughts on this?

@tmds
Copy link
Owner Author

tmds commented Mar 17, 2020

Let's verify first if this now maxes CPU.

@antonfirsov
Copy link
Collaborator

antonfirsov commented Mar 17, 2020

@tkp1n please let us know when you're done, I'm happy to redo the benchmarks. Do you plan to expose a parameter on IoUringOptions for linking behavior?

@tkp1n
Copy link
Contributor

tkp1n commented Mar 17, 2020

A very unstable first draft of the IoUring.Transport was already written not to use LINK. I'll revert to this logic (minus the bugs hopefully) on master without adding an option at first.

Adding an option would mean major code duplication or delay for the time I need to come up with a proper abstraction to avoid the duplication...

Please let me know if an option would make things significantly easier on your side though!

@tkp1n
Copy link
Contributor

tkp1n commented Mar 17, 2020

@antonfirsov I just pushed a commit to master to remove all intsances of IOSQE_IO_LINK from IoUring.Transport. Feel free to use the latest version on MyGet to re-run the benchmarks.

Let's hope for full CPU utilization :)

If the results are non-obvious, I'll add the discussed option to let us switch between the two approaches quickly.

@antonfirsov
Copy link
Collaborator

antonfirsov commented Mar 18, 2020

I have executed PlatformBenchmarks json benchmark in all configurations, and it looks like CPU is now utilized:

RPS CPU % Latency
Sockets 483,047 98 1.21
LinuxTransport 612,310 98 1.22
IoUring old 507,624 64 0.92
IoUring master 573,762 96 1.14

@CarterLi
Copy link

If the results are non-obvious, I'll add the discussed option to let us switch between the two approaches quickly.

Probably a switch that disable POLL, because in Linux 5.7+ kernel will handle it internally.

See: https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.7/io_uring&id=d7718a9d25a61442da8ee8aeeff6a0097f0ccfd6

Another option is to test IORING_FEAT_FAST_POLL feature flag, so that we can choose the best method

@tmds
Copy link
Owner Author

tmds commented Mar 18, 2020

I have executed PlatformBenchmarks json benchmark in all configurations

@antonfirsov is the IoUring old higher than previous results because you are running a different benchmark (i.e. platform vs middleware)?

Did you use Sockets-based Transport from this repo?

Can we see how Sockets-based Transport from this repo using AIO, compares to IoUring master?

--arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--arg "-e=iouringtransport" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256

@tkp1n
Copy link
Contributor

tkp1n commented Mar 18, 2020

Given above test results from @antonfirsov I think a switch to enable LINK is useless as perf is clearly better without it.

A switch to disable POLL as propoesed by @CarterLi will come in handy, when we start comparing the current approach to a POLL-less approach once 5.7 is stable.

Kernel version 5.7 seems to provide a much easier programming model for network developers and also better perf. I could also live with a decision not to use io_uring prior to 5.7 and have one clean and fast implementation based on the 5.7 features. This doesn't mean we should stop experimenting here, but as we have stable and well-performing alternatives to io_uring there is no immediate need to make it work at peak performance with versions prior to 5.7.

@tmds
Copy link
Owner Author

tmds commented Mar 18, 2020

These are the different ways we are polling:

  • Sockets: uses EPOLLET this means each operation has to be tried and fail with EAGAIN before it can assume epoll will raise an event. This had the advantage no explicit calls have to be made to re-arm poll (epoll_ctl), and in case the attempt is successful, there is no need to poll.
  • LinuxTransport: EPOLLONESHOT re-arms epoll each time. So for each Socket receive there is an epoll_ctl call that is not batcheable.
  • Tmds.LinuxAsync Sockets: uses EPOLLET also. The attempted receives may be batched under specifc conditions (not zero-byte receives, c=false).
  • IoUring old: poll and operations are linked and batched
  • IoUring: poll and operations are batched but no longer linked

@adamsitnik I never mentioned explicitly the difference between EPOLLONESHOT and EPOLLET usage. I decided to use EPOLLET here also (unlike LinuxTransport) due to the unbatcheable epoll_ctl calls.

@tmds
Copy link
Owner Author

tmds commented Mar 18, 2020

@adamsitnik @antonfirsov I'd like to run benchmarks from #39 (comment) so we have an idea of the potential gain of io_uring compared to AIO at 5.5 kernel.

@antonfirsov do you want to update the implementation so poll+operation are no longer batched? It should be possible to localize changes to IOUringAsyncEngine class.

@tmds
Copy link
Owner Author

tmds commented Mar 18, 2020

I'd like to run benchmarks from #39 (comment)

Please run them for a range of t=1..ProcessorCount. Especially IoUringTransport needs a higher t because the sends are on io_uring thread.

@antonfirsov
Copy link
Collaborator

antonfirsov commented Mar 18, 2020

@antonfirsov Anton Firszov FTE is the IoUring old higher than previous results because you are running a different benchmark (i.e. platform vs middleware)?

Yes I've been running "raw" PlatformBenchmarks in to make the comparison fair. I will run all requested benchmarks from this repo in 1-2 hours.

do you want to update the implementation so poll+operation are no longer batched? It should be possible to localize changes to IOUringAsyncEngine class.

Yes, I'll do that, hopefully by the end of the day.

@antonfirsov
Copy link
Collaborator

Can we see how Sockets-based Transport from this repo using AIO, compares to IoUring master?

> -e=epoll:

RequestsPerSecond:           448,373
Max CPU (%):                 99
WorkingSet (MB):             196
Avg. Latency (ms):           1.11

> -e=iouringtransport

RequestsPerSecond:           419,183
Max CPU (%):                 96
WorkingSet (MB):             177
Avg. Latency (ms):           0.79

> -e=iouringtransport -i=true

RequestsPerSecond:           272,128
WorkingSet (MB):             209
Avg. Latency (ms):           0.97

@tmds the last one is very surprising to me. When running PlatformBenchmarks with the default setup (t=12), setting options.ApplicationSchedulingMode=PipeScheduler.Inline improves perf a lot.

@tmds
Copy link
Owner Author

tmds commented Mar 18, 2020

@tmds the last one is very surprising to me. When running PlatformBenchmarks with the default setup (t=12), setting options.ApplicationSchedulingMode=PipeScheduler.Inline improves perf a lot.

--arg "-e=epoll" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256
--arg "-e=iouringtransport" --arg "-t=4" --arg "-s=false" --arg "-r=false" --arg "-c=false" --arg "-a=true" --arg "-w=false" --connections 256

Can you run these benchmarks for t=1..ProcessorCount?

You can also do benchmarks for both i=true and i=false if you want. If you do those for iouringtransport, run them against epoll also.

@antonfirsov
Copy link
Collaborator

antonfirsov commented Mar 18, 2020

I was unsure if you meant physical or logical ProcessorCount, so did it for t=1..12.

RPS values:

t epoll epoll+i iourt iourt+i
1 323,062 120,800 123,420 121,355
2 395,237 172,156 205,454 224,303
3 444,085 228,824 308,281 330,000
4 436,653 257,696 434,057 394,257
5 438,405 308,799 467,698 450,207
6 452,817 317,180 469,067 518,186
7 423,527 319,580 459,151 500,609
8 428,509 320,677 458,801 492,271
9 401,790 334,822 456,361 481,208
10 410,495 342,311 453,251 476,992
11 391,220 331,554 453,505 486,638
12 416,274 338,851 446,474 483,669

image

Latency values:

t epoll epoll+i iourt iourt+i
1 0.36 2.12 2.24 2.22
2 0.79 1.49 1.33 1.18
3 1.03 1.14 0.92 0.81
4 1.09 1.02 0.77 0.68
5 1.29 1.03 1 0.59
6 1.12 1.46 1.31 0.83
7 1.53 1.87 1.55 0.76
8 1.53 2.07 1.61 0.77
9 1.66 2.26 1.7 0.7
10 1.72 2.41 1.74 0.74
11 1.72 2.67 1.52 0.77
12 1.24 2.59 1.44 0.7

image

@lpereira
Copy link

lpereira commented Mar 19, 2020 via email

@antonfirsov
Copy link
Collaborator

Looks like it doesn't worth to go above the number of physical cores ... at least on this HW.

@tmds
Copy link
Owner Author

tmds commented Mar 19, 2020

Ooh, these numbers are starting to look interesting.

Yes, we shouldn't abandon io_uring at 5.5 kernel.

Let's see how it behaves when we use io_uring in Socket implementation.

You can also do benchmarks for both i=true and i=false if you want. If you do those for iouringtransport, run them against epoll also.

i=true controls a PipeScheduler. We don't understand how i=true affects performance. This option was added to improve performance, and then measurements showed it decreased performance significantly with epoll: #41.
With io_uring it has the intended effect. Based on previous measurements epoll with i=true may be unable to Max CPU even at higher values of t. It would be useful to understand what is happening underneath.

@tmds
Copy link
Owner Author

tmds commented Mar 20, 2020

See #67 (comment). epoll+AIO and io_uring show similar performance. The iouring transport has better performance. Note that the transport is batching sends, while the Socket based implementations are not.

It would be nice if we can make i=true a gain as we've seen here for iouring transport. We'll look at that in #41.

tkp1n added a commit to tkp1n/IoUring.Transport that referenced this issue Mar 21, 2020
… / SMT

Benchmarks have shown that it is not profitable to go beyond the number of physical cores:
tmds/Tmds.LinuxAsync#39 (comment)
@tkp1n
Copy link
Contributor

tkp1n commented Mar 21, 2020

@tmds I guess you could close axboe/liburing#97 as well... even though we only saw 96% utilization yet and never 99% 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants