Benchmark parameters for Citrine runs #78

antonfirsov · 2020-03-26T23:34:32Z

Let's define the parameters we want to use for extensive Citrine runs (or at least for the first one).

The machines are mine for Friday (big thanks to Sebastien!), but my naive approach to try all combinations of the major parameters is defining too many jobs, even for a whole day run. Are some of these combinations worthless to include even in a comprehensive analysis?

I'm using the syntax of my new tool for #74 to define the benchmakrs

DefaultTransport to get a baseline:

e=DefaultTransport
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

LinuxTransport for our information:

e=LinuxTransport
i=true
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

epoll combinations

Normally I would run epoll with all possible combinations of important parameters, but the following definition would mean ~500 benchmark executions. Would be nice to reduce it.

e=epoll
s=false
r=false
w=false
c=true,false
i=false,true
a=false,true
o=inline,iothread,ioqueue,threadpool
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

ThreadPool with `COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0`

I would set c=true here:

e=epoll
s=false
r=false
w=false
c=true
i=true
a=true
o=threadpool
t=1,2,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,24,26,28,30

io_uring combinations

e=iouring
s=false
r=false
w=false
c=true,false
i=false,true
o=inline,iothread,ioqueue,threadpool
t=1,2,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,24,26,28,30

@tmds @adamsitnik anything missing? Which combos should I cut off?

The text was updated successfully, but these errors were encountered:

adamsitnik · 2020-03-27T11:22:31Z

I would cut few of the thread counts to reduce the number of runs.

t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

tmds · 2020-03-27T12:46:28Z

Baseline

I consider this the baseline:

e=epoll
s=false
r=false
w=false
c=threadpool
i=threadpool
a=false
o=ioqueue

Benchmarks focused on batching

batch by deferring receives

e=epoll/iouring
r=true
a=true

this should perform worse than

batch receives on poll thread

e=epoll/iouring
c=inline
a=true

continue inline without batching

It is also interesting to run the continuations inline, without batching, to differentiate between effect of batching and continuation.

e=epoll
c=inline
a=false

note: iouring doesn't have a mode that disables batching.

Benchmarks focused on scheduling

e=epoll
c=inline/threadpool
i=inline/threadpool
o=inline,iothread,ioqueue,threadpool
a=true[,false]

I'm not sure if we should include a=false. I assume it can only be better, but better to not assume?
Since this is all scheduling, we should consider running all of these also with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0.

For iouring similar results are expected, so I'd limit to:

e=iouring
c=inline
i=inline
o=inline,iothread,ioqueue,threadpool

antonfirsov · 2020-03-27T12:52:49Z

I'm not sure if we should include a=false. I assume it can only be better, but better to not assume?

I think the point is to have a proper comparison between AIO / no AIO.

Since this is all scheduling, we should consider running all of these also with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0

Doesn't this affect only ThreadPool schedulers? Isn't it waste of time to run all the rest?

tmds · 2020-03-27T12:52:52Z

DefaultTransport to get a baseline:

e=DefaultTransport

I consider this one interesting too. It's important to note that t controls the IOQueue count.
If we are using a daily ASP.NET Core build, maybe we can also implement, and set w=false?

LinuxTransport for our information:

e=LinuxTransport
i=true

For max performance, you should also set s=true also.

tmds · 2020-03-27T12:58:04Z

Doesn't this affect only ThreadPool schedulers? Isn't it waste of time to run all the rest?

IoQueue is also a ThreadPool scheduler.
And I think Kestrel will always use ThreadPool for dispatching the HTTP handling in it's KestrelConnection class.

Maybe these could be left out for COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0:

c=inline
i=inline
o=inline,iothread
a=true[,false]

tmds · 2020-03-27T13:04:30Z

@adamsitnik @antonfirsov do we want to run middleware json/platform json (#32)? I'm fine using middleware, but maybe you have a specific preference for platform?

Note that the pipelined plaintext will suffer from o=inline, since every response will be sent separately, instead of batched together by other output schedulers.

adamsitnik · 2020-03-27T17:44:36Z

do we want to run middleware json/platform json

I think that we should use middleware and compare it with current middleware implementation which is around 750k RPS (link to PowerBI)

antonfirsov · 2020-03-31T00:43:59Z

@tmds @adamsitnik you can check the results here:
https://microsoft-my.sharepoint.com/:x:/p/anfirszo/ETUPVQ8QN9BGmysfL5uDJswBpZsSrKZtuFaMtaoU7ifGUQ?e=s1H2gY

The grouping should be straightforward, but if it's not I'm happy to answer questions. On several places, there are multiple versions of the same diagram with different series enabled/disabled. Red lines in table are for missing or outlier data.

@tmds does this help getting insights? Is there anything unexpected to you? Anything else we should run?

antonfirsov · 2020-03-31T18:44:01Z

@tmds as we discussed, I extended the ThreadPool scheduling benchmarks with t=1,2,3, and also added graphs comparing the impact of COMPlus_ThreadPool_UnfairSemaphoreSpinLimit. It's only measurable for small t values.

tmds · 2020-04-01T12:44:20Z

Thanks Anton! The effect being mostly being at lower t is expected. At lower t more work comes in batches from the epoll thread to the ThreadPool.

tmds · 2020-04-03T12:27:22Z

@antonfirsov this is the combination we discussed that would be interesting also to benchmark on Citrine:

e=epoll
c=inline
i=threadpool
o=inline,iothread,ioqueue,threadpool
a=true
t=1,2,3,4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

tmds · 2020-04-03T20:53:19Z

Anton, can you also run these benchmarks?

e=epoll
c=threadpool
i=inline
o=inline,iothread,ioqueue,threadpool
a=true
t=1,2,3,4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

antonfirsov · 2020-04-07T22:53:15Z

Also sharing the doc here:
https://microsoft.sharepoint.com/:x:/r/teams/SocketsPerfWG/Shared%20Documents/General/ContinuationsComparison-Full-0407.xlsx?d=w4e6c85d2c7c54431b8b77793d894e6d0&csf=1&web=1&e=hg3T2q

@tmds added the 2 graphs you requested

tmds · 2020-04-08T03:27:56Z

Thank you Anton!

tmds · 2020-04-08T03:37:53Z

We're missing

t=1
c=threadpool
i=inline

It's an interesting point on the graph (should be best for c=threadpool,i=inline). I'm going to assume same value as t=2.

This was referenced Mar 27, 2020

Inline output reader reduces performance for JSON benchmark #40

Closed

Running application code Inline reduces performance of JSON benchmark #41

Closed

Benchmark variations #24

Closed

tmds mentioned this issue Apr 1, 2020

Echo benchmark #73

Open

tmds mentioned this issue Apr 8, 2020

ASP.NET Schedulers #84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark parameters for Citrine runs #78

Benchmark parameters for Citrine runs #78

antonfirsov commented Mar 26, 2020 •

edited

Loading

adamsitnik commented Mar 27, 2020

tmds commented Mar 27, 2020

antonfirsov commented Mar 27, 2020

tmds commented Mar 27, 2020

tmds commented Mar 27, 2020

tmds commented Mar 27, 2020 •

edited

Loading

adamsitnik commented Mar 27, 2020

antonfirsov commented Mar 31, 2020 •

edited

Loading

antonfirsov commented Mar 31, 2020 •

edited

Loading

tmds commented Apr 1, 2020

tmds commented Apr 3, 2020 •

edited

Loading

tmds commented Apr 3, 2020

antonfirsov commented Apr 7, 2020

tmds commented Apr 8, 2020

tmds commented Apr 8, 2020 •

edited

Loading

Benchmark parameters for Citrine runs #78

Benchmark parameters for Citrine runs #78

Comments

antonfirsov commented Mar 26, 2020 • edited Loading

DefaultTransport to get a baseline:

LinuxTransport for our information:

epoll combinations

ThreadPool with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0

io_uring combinations

adamsitnik commented Mar 27, 2020

tmds commented Mar 27, 2020

Baseline

Benchmarks focused on batching

batch by deferring receives

batch receives on poll thread

continue inline without batching

Benchmarks focused on scheduling

antonfirsov commented Mar 27, 2020

tmds commented Mar 27, 2020

tmds commented Mar 27, 2020

tmds commented Mar 27, 2020 • edited Loading

adamsitnik commented Mar 27, 2020

antonfirsov commented Mar 31, 2020 • edited Loading

antonfirsov commented Mar 31, 2020 • edited Loading

tmds commented Apr 1, 2020

tmds commented Apr 3, 2020 • edited Loading

tmds commented Apr 3, 2020

antonfirsov commented Apr 7, 2020

tmds commented Apr 8, 2020

tmds commented Apr 8, 2020 • edited Loading

antonfirsov commented Mar 26, 2020 •

edited

Loading

ThreadPool with `COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0`

tmds commented Mar 27, 2020 •

edited

Loading

antonfirsov commented Mar 31, 2020 •

edited

Loading

antonfirsov commented Mar 31, 2020 •

edited

Loading

tmds commented Apr 3, 2020 •

edited

Loading

tmds commented Apr 8, 2020 •

edited

Loading