Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark parameters for Citrine runs #78

Open
antonfirsov opened this issue Mar 26, 2020 · 15 comments
Open

Benchmark parameters for Citrine runs #78

antonfirsov opened this issue Mar 26, 2020 · 15 comments

Comments

@antonfirsov
Copy link
Collaborator

antonfirsov commented Mar 26, 2020

Let's define the parameters we want to use for extensive Citrine runs (or at least for the first one).

The machines are mine for Friday (big thanks to Sebastien!), but my naive approach to try all combinations of the major parameters is defining too many jobs, even for a whole day run. Are some of these combinations worthless to include even in a comprehensive analysis?

I'm using the syntax of my new tool for #74 to define the benchmakrs

DefaultTransport to get a baseline:

e=DefaultTransport
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

LinuxTransport for our information:

e=LinuxTransport
i=true
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

epoll combinations

Normally I would run epoll with all possible combinations of important parameters, but the following definition would mean ~500 benchmark executions. Would be nice to reduce it.

e=epoll
s=false
r=false
w=false
c=true,false
i=false,true
a=false,true
o=inline,iothread,ioqueue,threadpool
t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

ThreadPool with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0

I would set c=true here:

e=epoll
s=false
r=false
w=false
c=true
i=true
a=true
o=threadpool
t=1,2,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,24,26,28,30

io_uring combinations

e=iouring
s=false
r=false
w=false
c=true,false
i=false,true
o=inline,iothread,ioqueue,threadpool
t=1,2,4,6,8,10,11,12,13,14,15,16,17,18,19,20,22,24,26,28,30

@tmds @adamsitnik anything missing? Which combos should I cut off?

@adamsitnik
Copy link
Collaborator

I would cut few of the thread counts to reduce the number of runs.

t=4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

@tmds
Copy link
Owner

tmds commented Mar 27, 2020

Baseline

I consider this the baseline:

e=epoll
s=false
r=false
w=false
c=threadpool
i=threadpool
a=false
o=ioqueue

Benchmarks focused on batching

batch by deferring receives

e=epoll/iouring
r=true
a=true

this should perform worse than

batch receives on poll thread

e=epoll/iouring
c=inline
a=true

continue inline without batching

It is also interesting to run the continuations inline, without batching, to differentiate between effect of batching and continuation.

e=epoll
c=inline
a=false

note: iouring doesn't have a mode that disables batching.

Benchmarks focused on scheduling

e=epoll
c=inline/threadpool
i=inline/threadpool
o=inline,iothread,ioqueue,threadpool
a=true[,false]

I'm not sure if we should include a=false. I assume it can only be better, but better to not assume?
Since this is all scheduling, we should consider running all of these also with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0.

For iouring similar results are expected, so I'd limit to:

e=iouring
c=inline
i=inline
o=inline,iothread,ioqueue,threadpool

@antonfirsov
Copy link
Collaborator Author

I'm not sure if we should include a=false. I assume it can only be better, but better to not assume?

I think the point is to have a proper comparison between AIO / no AIO.

Since this is all scheduling, we should consider running all of these also with COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0

Doesn't this affect only ThreadPool schedulers? Isn't it waste of time to run all the rest?

@tmds
Copy link
Owner

tmds commented Mar 27, 2020

DefaultTransport to get a baseline:

e=DefaultTransport

I consider this one interesting too. It's important to note that t controls the IOQueue count.
If we are using a daily ASP.NET Core build, maybe we can also implement, and set w=false?

LinuxTransport for our information:

e=LinuxTransport
i=true

For max performance, you should also set s=true also.

@tmds
Copy link
Owner

tmds commented Mar 27, 2020

Doesn't this affect only ThreadPool schedulers? Isn't it waste of time to run all the rest?

IoQueue is also a ThreadPool scheduler.
And I think Kestrel will always use ThreadPool for dispatching the HTTP handling in it's KestrelConnection class.

Maybe these could be left out for COMPlus_ThreadPool_UnfairSemaphoreSpinLimit=0:

c=inline
i=inline
o=inline,iothread
a=true[,false]

@tmds
Copy link
Owner

tmds commented Mar 27, 2020

@adamsitnik @antonfirsov do we want to run middleware json/platform json (#32)? I'm fine using middleware, but maybe you have a specific preference for platform?

Note that the pipelined plaintext will suffer from o=inline, since every response will be sent separately, instead of batched together by other output schedulers.

@adamsitnik
Copy link
Collaborator

do we want to run middleware json/platform json

I think that we should use middleware and compare it with current middleware implementation which is around 750k RPS (link to PowerBI)

@antonfirsov
Copy link
Collaborator Author

antonfirsov commented Mar 31, 2020

@tmds @adamsitnik you can check the results here:
https://microsoft-my.sharepoint.com/:x:/p/anfirszo/ETUPVQ8QN9BGmysfL5uDJswBpZsSrKZtuFaMtaoU7ifGUQ?e=s1H2gY

The grouping should be straightforward, but if it's not I'm happy to answer questions. On several places, there are multiple versions of the same diagram with different series enabled/disabled. Red lines in table are for missing or outlier data.

@tmds does this help getting insights? Is there anything unexpected to you? Anything else we should run?

@antonfirsov
Copy link
Collaborator Author

antonfirsov commented Mar 31, 2020

@tmds as we discussed, I extended the ThreadPool scheduling benchmarks with t=1,2,3, and also added graphs comparing the impact of COMPlus_ThreadPool_UnfairSemaphoreSpinLimit. It's only measurable for small t values.

image

@tmds
Copy link
Owner

tmds commented Apr 1, 2020

Thanks Anton! The effect being mostly being at lower t is expected. At lower t more work comes in batches from the epoll thread to the ThreadPool.

@tmds tmds mentioned this issue Apr 1, 2020
@tmds
Copy link
Owner

tmds commented Apr 3, 2020

@antonfirsov this is the combination we discussed that would be interesting also to benchmark on Citrine:

e=epoll
c=inline
i=threadpool
o=inline,iothread,ioqueue,threadpool
a=true
t=1,2,3,4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

@tmds
Copy link
Owner

tmds commented Apr 3, 2020

Anton, can you also run these benchmarks?

e=epoll
c=threadpool
i=inline
o=inline,iothread,ioqueue,threadpool
a=true
t=1,2,3,4,6,8,10,12,13,14,15,16,17,18,20,22,24,26,28,30

@antonfirsov
Copy link
Collaborator Author

@tmds
Copy link
Owner

tmds commented Apr 8, 2020

Thank you Anton!

@tmds
Copy link
Owner

tmds commented Apr 8, 2020

We're missing

t=1
c=threadpool
i=inline

It's an interesting point on the graph (should be best for c=threadpool,i=inline). I'm going to assume same value as t=2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants