Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Cluster Sharding perf issue #5203

Open
carl-camilleri-uom opened this issue Aug 16, 2021 · 32 comments
Open

[PERF] Cluster Sharding perf issue #5203

carl-camilleri-uom opened this issue Aug 16, 2021 · 32 comments

Comments

@carl-camilleri-uom
Copy link

Version Information
Version of Akka.NET? 1.4.23
Which Akka.NET Modules? Akka.Cluster.Sharding

Describe the performance issue

A minimum viable repo that reproduces the issue is at https://github.com/carlcamilleri/benchmark-akka-cluster

Running two n2-standard-8 nodes in GCP (8 CPU @ 2.80 GHz and 32GB RAM) with Windows Server 2019 in GCP ("instance-1" and "instance-2"), and a third machine to run the benchmarks from

First check:
curl http://instance-1:5000/5
Response: akka://ping-pong-cluster-system/system/sharding/PingPongActor/13/5(pid=2372,hostname=instance-2)

Therefore entity id 5 actor is hosted on instance-2 server

wrk -t48 -c400 -d30s http://instance-2:5000/5

Running 30s test @ http://instance-2:5000/5
48 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 19.54ms 70.10ms 1.09s 94.22%
Req/Sec 2.41k 429.13 12.12k 86.40%
3360471 requests in 30.10s, 666.60MB read
Socket errors: connect 0, read 0, write 202, timeout 0
Requests/sec: 111645.29
Transfer/sec: 22.15MB

wrk -t48 -c400 -d30s http://instance-1:5000/5


Running 30s test @ http://instance-1:5000/5
48 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 254.21ms 286.27ms 1.02s 78.54%
Req/Sec 119.09 127.71 0.94k 85.22%
73325 requests in 30.03s, 14.55MB read
Socket errors: connect 0, read 0, write 216, timeout 0
Requests/sec: 2441.41
Transfer/sec: 495.91KB

For interest I've also repeated test (1) (i.e. workload on the endpoint which requests the local actor) but with serialize-messages = on, and the result is:

Running 30s test @ http://instance-2:5000/5
48 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 24.77ms 90.16ms 1.15s 94.33%
Req/Sec 1.89k 339.47 4.94k 90.42%
2579946 requests in 30.08s, 511.77MB read
Socket errors: connect 0, read 0, write 246, timeout 0
Requests/sec: 85760.95
Transfer/sec: 17.01MB

So Hyperion serialisation drops throughput from >111k to >85k, which is probably expected

Data and Specs

ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.

Expected behavior
Cross-Machine communication in Cluster Sharding expected to be faster.

Actual behavior
Cross-Machine communication in Cluster Sharding seems to be extremely slow and unusable for my use case (an OLTP workload)

Environment
.NET 5.0
Windows Server 2019
n2-standard-8 machine in GCP (8 CPU @ 2.80 GHz and 32GB RAM)

@Aaronontheweb
Copy link
Member

ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.

The local actor - is this actor hosted by the sharding system or is just a local actor floating in-memory?

@Aaronontheweb
Copy link
Member

Need to differentiate where the bottleneck is:

  • Sharding system
  • Remoting system

If it's the remoting system, there's things you can do to adjust but ultimately we have to replace our transport system (it's the lion's share of the v1.5 roadmap)

If it's the sharding system we can patch that too - opened a similar issue on the entity spawning side last week: #5190

Can you give us some color here to inform these numbers? We'll take a look at the source too.

@carl-camilleri-uom
Copy link
Author

Thanks, in reply to:

ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.

The local actor - is this actor hosted by the sharding system or is just a local actor floating in-memory?

Correct it's hosted by the sharding system, in both cases the REST endpoint performs this call:

var resPong = await _shardRegion.Ask<MsgPong>(new ShardEnvelope<MsgPing>(entityId, new MsgPing()));

So the "local" test is actually a call to the REST endpoint hosted on the machine where the sharding system has spawned the actor.

Happy to provide further details or tests that can help.

Thanks

@Aaronontheweb
Copy link
Member

My working theory on this is, assuming you're getting both numbers from the same sample:

  1. The hops used to route messages to remote Shards passing through the ShardRegion introduces additional routing latency - especially since you're not pipelining in that code sample;
  2. The actual ShardRegion actor and Shard actors themselves, functioning as routers, are doing so in an inefficient manner and introduces latency in both local + remote scenarios, but it's more visible in the latter for reasons related to the additional routing that has to occur;
  3. There is going to be inherent latency in the Akka.Remote I/O pipeline, which is unavoidable, but shouldn't be anywhere close to this severe at these levels.

@Aaronontheweb
Copy link
Member

For running this test against a remote sharded actor in your scenario, do you just run two processes and target a entity id not hosted on the HTTP target node?

@carl-camilleri-uom
Copy link
Author

Correct, basically the setup is as follows:
instance-1 : Windows VM running an instance of the code in the repo
instance-2 : Windows VM running an instance of the code in the repo
benchmark-machine : Makes calls to either instance-1 or instance-2

From benchmark-machine, curl http://instance-1:5000/5 returns akka://ping-pong-cluster-system/system/sharding/PingPongActor/13/5(pid=2372,hostname=instance-2, so we know that in this case entity id = 5 is hosted by the sharding system on instance-2 machine.

So hitting the endpoint http://instance-2:5000/5 constitutes a call to the local actor, whereas http://instance-1:5000/5 constitutes a call to the remote actor

@Aaronontheweb
Copy link
Member

Thanks! I'll take a look at this and at the very least, resolve #3083

@Aaronontheweb
Copy link
Member

We've been able to reproduce the 2.4k msg/s figure exactly in the benchmarks we created on #5209

I'm working on using Phobos to do some end-to-end tracing on the shard routing system to see where the most time is piling up. Did a read of the code last night and nothing obvious jumps out at me, but there is a drastic difference in remote vs. local shard performance that is clearly visible and consistent across different hardware profiles.

@Aaronontheweb
Copy link
Member

First time a shard + entity is allocated, Phobos graph:

image

@Aaronontheweb
Copy link
Member

End to end trace of an already-created entity actor from a remote host:

image

@Aaronontheweb
Copy link
Member

A local shard routed version of that chart, for comparison:

image

image

@Aaronontheweb
Copy link
Member

Another example of slow message handling inside the shard / shard region actors going across node

image

@Aaronontheweb
Copy link
Member

I think I know where to look now - it looks like there's a combination of:

  1. Dispatch / context switching overhead - the gaps between spans on these charts, some of which will be Akka.Remote invocations. This could also be caused by buffering inside the ShardRegion but that won't show up explicitly in the traces here.
  2. Long-running message processing operations - some of those runtimes are going to be skewed by Phobos itself since there's some overhead during tracing but it shouldn't be much.

I think I have enough data to go off of - I'll get started on improving the figures in our benchmark.

@carl-camilleri-uom
Copy link
Author

Having done some checks I believe the perf. bottleneck is coming from the amount of boxing/unboxing that is performed by the ShardMessageExtractor

The attached ZIP contains:
patch_without-unboxing.diff - a patch that avoids unboxing in an extremely hacky way
patch_with-unboxing.diff - a patch that is the same as patch_without-unboxing.diff but simply enables boxing/unboxing
patches.zip

Without unboxing, the benchmark SingleRequestResponseToRemoteEntity results in ~350ms/op:

image

Simply adding boxing/unboxing drops perf to ~4s/op:
image

In comparison, on the same machine, the benchmark SingleRequestResponseToLocalEntity is also at ~350ms/op:
image

Possibly an approach could be a generic version of IMessageExtractor that then avoids the boxing/unboxing overhead in the custom ShardMessageExtractor implementation ? I believe this is the way it is being done in the JVM

Thanks

@Aaronontheweb
Copy link
Member

@carl-camilleri-uom that's great work. Explains why this issue is unique to sharding. I also noticed we're doing some things like converting the IMessageExtractor into a delegate at startup inside the ShardRegion, which likely does not help either.

Adding a generic IMessageExtractor should be doable, but we'll still need to clean up the delegate mess inside ShardRegion. I'll see if I can deal with the former first.

We can also add a benchmark for the HashCodeMessageExtractor if there isn't one already.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Aug 19, 2021

The changes you included in your patch file - I can't get them to run.

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.                  
 ---> System.ArgumentNullException: The message cannot be null. (Parameter 'message')                                   
   at Akka.Actor.Envelope..ctor(Object message, IActorRef sender, ActorSystem system) in D:\Repositories\olympus\akka.ne
t\src\core\Akka\Actor\Message.cs:line 28                                                                                
   at Akka.Actor.ActorCell.SendMessage(IActorRef sender, Object message) in D:\Repositories\olympus\akka.net\src\core\Ak
ka\Actor\ActorCell.cs:line 418                                                                                          
   at Akka.Actor.Futures.Ask[T](ICanTell self, Func`2 messageFactory, Nullable`1 timeout, CancellationToken cancellation
Token) in D:\Repositories\olympus\akka.net\src\core\Akka\Actor\Futures.cs:line 143                                      
   at Akka.Actor.Futures.Ask[T](ICanTell self, Object message, Nullable`1 timeout, CancellationToken cancellationToken) 
in D:\Repositories\olympus\akka.net\src\core\Akka\Actor\Futures.cs:line 105                                             
   at Akka.Cluster.Benchmarks.Sharding.ShardMessageRoutingBenchmarks.SingleRequestResponseToRemoteEntity() in D:\Reposit
ories\olympus\akka.net\src\benchmark\Akka.Cluster.Benchmarks\Sharding\ShardMessageRoutingBenchmarks.cs:line 112         
   at BenchmarkDotNet.Autogenerated.Runnable_1.__Workload()                                                             
   at BenchmarkDotNet.Autogenerated.Runnable_1.WorkloadActionUnroll(Int64 invokeCount)                                  
   at BenchmarkDotNet.Engines.Engine.RunIteration(IterationData data)                                                   
   at BenchmarkDotNet.Engines.EngineFactory.Jit(Engine engine, Int32 jitIndex, Int32 invokeCount, Int32 unrollFactor)   
   at BenchmarkDotNet.Engines.EngineFactory.CreateReadyToRun(EngineParameters engineParameters)                         
   at BenchmarkDotNet.Autogenerated.Runnable_1.Run(BenchmarkCase benchmarkCase, IHost host)                             
   --- End of inner exception stack trace ---                                                                           
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor, Boo
lean wrapExceptions)                                                                                                    
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters
, CultureInfo culture)                                                                                                  
   at System.Reflection.MethodBase.Invoke(Object obj, Object[] parameters)                                              
   at BenchmarkDotNet.Toolchains.InProcess.Emit.Implementation.RunnableProgram.Run(BenchmarkId benchmarkId, Assembly par
titionAssembly, BenchmarkCase benchmarkCase, IHost host)                                                                
ExitCode != 0 and no results reported                                                                                   
No more Benchmark runs will be launched as NO measurements were obtained from the previous run!                         

Could you send them as a PR instead?

@Aaronontheweb
Copy link
Member

So this is really an Akka.Remote performance issue. I replicated the actor hierarchy in its essence here: https://github.com/Aaronontheweb/RemotingBenchmark

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1165 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.302
  [Host]     : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
  DefaultJob : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT

Method MsgCount Mean Error StdDev
SingleRequestResponseToLocalEntity 10000 3.575 s 0.0681 s 0.0569 s

We can shave off maybe 25-30% of the performance overhead by improving the sharding system's efficiency of message handling, but it still comes down to Akka.Remote.

What's interesting here is that the benchmark you designed is an absolute worst case performance scenario for Akka.Remote:

this use case is pretty interesting too because:

  1. Using temporary actors on all of the Ask<T> operations eliminates our ability to cache deserialized IActorRefs on both sides of the connection;
  2. await-ing each call eliminates our ability to batch in Akka.Remote and in all of the actors' mailboxes; and
  3. await-ing each request / response pair blocks flow for the entire stream, meaning one request can't start before the previous one completes.

Factor 3 is the most expensive to overcome - for the sake of comparison, if I change

[Benchmark]
public async Task SingleRequestResponseToRemoteEntity()
{
    for (var i = 0; i < MsgCount; i++)
        await _sys2Remote.Ask<ShardedMessage>(_messageToSys2);
}

To

[Benchmark]
public async Task SingleRequestResponseToRemoteEntity()
{
    var tasks = new List<Task>();
    for (var i = 0; i < MsgCount; i++)
        tasks.Add(_sys2Remote.Ask<ShardedMessage>(_messageToSys2));
    await Task.WhenAll(tasks);
}

The performance profile changes to:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1165 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.302
  [Host]     : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
  DefaultJob : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT

Method MsgCount Mean Error StdDev
SingleRequestResponseToLocalEntity 10000 637.2 ms 8.48 ms 7.51 ms

That looks more like it to me - all of the tasks were completed in the same order but they were just all allowed to start at the same time. If I change this to use Tell we get this entire workload done in about 110ms using Sharding, which is about ~90k msg/s (sounds about right.)


Here's what I don't understand about your benchmark @carl-camilleri-uom - you ran this:

wrk -t48 -c400 -d30s http://instance-1:5000/5

And got

Running 30s test @ http://instance-1:5000/5
48 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 254.21ms 286.27ms 1.02s 78.54%
Req/Sec 119.09 127.71 0.94k 85.22%
73325 requests in 30.03s, 14.55MB read
Socket errors: connect 0, read 0, write 216, timeout 0
Requests/sec: 2441.41
Transfer/sec: 495.91KB

Is wrk forcing all of its HTTP requests to that endpoint to run in sequence or in parallel? Because by the looks of your numbers they line up exactly with what would happen if the requests were run only in sequence. In a real world scenario with multiple concurrent requests for the same entity arriving all at once you'd expect results similar to my second set of benchmark numbers. What can account for the difference there?

@carl-camilleri-uom
Copy link
Author

carl-camilleri-uom commented Aug 22, 2021

@Aaronontheweb thanks for this information.

First of all, apologies for the initial indication that boxing/unboxing was causing the issue - it was a red herring as indeed with your benchmark on HashCodeMessageExtractor I'm not able to replicate any latency.

Secondly, thanks also for the details. provided.

With regards to the workload submitted by wrk, I believe this is parallelised by the number of connections according to the specs at https://github.com/wg/wrk.

Thus wrk -t48 -c400 -d30s http://instance-2:5000/5 should be executing 400 concurrent connections allocated over 48 threads on the benchmark machine.

I have also tried tests using wrk2 (https://github.com/giltene/wrk2) with the same results.

I have been testing some further approaches as follows and the repo at https://github.com/carlcamilleri/benchmark-akka-cluster has been updated as follows:

  1. The same application exposes a new endpoint /batch/{entityId} 100k messages to the entity actor and awaits them all.
    A single request to this endpoint and requesting a remote entity completes within an average of 2.56s which would work out at ~40k req/s and is even in-line with your benchmarks when awaiting a set of Tasks:
    image

  2. The endpoint /{entityId} is now taking another approach whereby the request received on the API is sent to a local Actor Router, which then handles the request to the cluster without explicitly await-ing, although of course the web server would wait for the response from the actor to complete the HTTP request. This still re-produces the issue: I don't get more than 3k req/s

image

I guess therefore my question is whether there is perhaps a better way to approach this problem? Basically, the problem at hand could be described simply as the need to implement a REST API that gives back the details of a business entity that is cached within a sharded cluster.

The API consumers are of course independent and the API is synchronous i.e. the API consumer would need to wait for the response from the API. If I understand well, this is what is introducing the problem (at least with my approach) i.e. where the different (parallel, mutually-exclusive) requests on the API are being handled in a serial manner on the cluster.

Thank you

@carl-camilleri-uom
Copy link
Author

carl-camilleri-uom commented Aug 28, 2021

@to11mtm @Aaronontheweb thanks for the analysis in #5230 . Just to confirm do we expect this to improve performance even in the case of the following approach? :

image

akka_behaviours-Page-2

For reference I have benchmarked this scenario (https://github.com/carlcamilleri/benchmark-akka-cluster/blob/master/Startup.cs) which is still performing poorly

Thanks

@to11mtm
Copy link
Member

to11mtm commented Aug 28, 2021

@to11mtm @Aaronontheweb thanks for the analysis in #5230 . Just to confirm do we expect this to improve performance even in the case of the following approach? :

image

akka_behaviours-Page-2

For reference I have benchmarked this scenario (https://github.com/carlcamilleri/benchmark-akka-cluster/blob/master/Startup.cs) which is still performing poorly

Thanks

#5320 would theoretically improve performance issues around remote asks. I don't think they would help in the case of the scenario in your benchmark.

Looking at said benchmark however, I would suggest:

  • Using a Round-Robin router instead of a Random Router
  • Using a lot more actors in the round-robin pool (I'd say probably 25-50).

@Aaronontheweb Aaronontheweb added this to the 1.4.36 milestone Mar 18, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.36, 1.4.37, 1.4.38 Apr 14, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.38, 1.4.39 May 25, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.39, 1.4.40 Jun 1, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.40, 1.4.41 Jul 27, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.41, 1.4.42 Sep 7, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.42, 1.4.43, 1.4.44 Sep 23, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.44, 1.4.45, 1.4.46 Oct 17, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.46, 1.4.47 Nov 15, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.47, 1.4.48 Dec 9, 2022
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.48, 1.4.49 Jan 5, 2023
@Aaronontheweb Aaronontheweb modified the milestones: 1.4.49, 1.4.50 Jan 26, 2023
@Aaronontheweb Aaronontheweb removed this from the 1.4.50 milestone Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants