-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Cluster Sharding perf issue #5203
Comments
The local actor - is this actor hosted by the sharding system or is just a local actor floating in-memory? |
Need to differentiate where the bottleneck is:
If it's the remoting system, there's things you can do to adjust but ultimately we have to replace our transport system (it's the lion's share of the v1.5 roadmap) If it's the sharding system we can patch that too - opened a similar issue on the entity spawning side last week: #5190 Can you give us some color here to inform these numbers? We'll take a look at the source too. |
Thanks, in reply to:
Correct it's hosted by the sharding system, in both cases the REST endpoint performs this call:
So the "local" test is actually a call to the REST endpoint hosted on the machine where the sharding system has spawned the actor. Happy to provide further details or tests that can help. Thanks |
My working theory on this is, assuming you're getting both numbers from the same sample:
|
For running this test against a remote sharded actor in your scenario, do you just run two processes and target a entity id not hosted on the HTTP target node? |
Correct, basically the setup is as follows: From benchmark-machine, So hitting the endpoint http://instance-2:5000/5 constitutes a call to the local actor, whereas http://instance-1:5000/5 constitutes a call to the remote actor |
Thanks! I'll take a look at this and at the very least, resolve #3083 |
We've been able to reproduce the 2.4k msg/s figure exactly in the benchmarks we created on #5209 I'm working on using Phobos to do some end-to-end tracing on the shard routing system to see where the most time is piling up. Did a read of the code last night and nothing obvious jumps out at me, but there is a drastic difference in remote vs. local shard performance that is clearly visible and consistent across different hardware profiles. |
I think I know where to look now - it looks like there's a combination of:
I think I have enough data to go off of - I'll get started on improving the figures in our benchmark. |
Having done some checks I believe the perf. bottleneck is coming from the amount of boxing/unboxing that is performed by the The attached ZIP contains: Without unboxing, the benchmark Simply adding boxing/unboxing drops perf to ~4s/op: In comparison, on the same machine, the benchmark Possibly an approach could be a generic version of Thanks |
@carl-camilleri-uom that's great work. Explains why this issue is unique to sharding. I also noticed we're doing some things like converting the Adding a generic We can also add a benchmark for the |
The changes you included in your patch file - I can't get them to run. System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation.
---> System.ArgumentNullException: The message cannot be null. (Parameter 'message')
at Akka.Actor.Envelope..ctor(Object message, IActorRef sender, ActorSystem system) in D:\Repositories\olympus\akka.ne
t\src\core\Akka\Actor\Message.cs:line 28
at Akka.Actor.ActorCell.SendMessage(IActorRef sender, Object message) in D:\Repositories\olympus\akka.net\src\core\Ak
ka\Actor\ActorCell.cs:line 418
at Akka.Actor.Futures.Ask[T](ICanTell self, Func`2 messageFactory, Nullable`1 timeout, CancellationToken cancellation
Token) in D:\Repositories\olympus\akka.net\src\core\Akka\Actor\Futures.cs:line 143
at Akka.Actor.Futures.Ask[T](ICanTell self, Object message, Nullable`1 timeout, CancellationToken cancellationToken)
in D:\Repositories\olympus\akka.net\src\core\Akka\Actor\Futures.cs:line 105
at Akka.Cluster.Benchmarks.Sharding.ShardMessageRoutingBenchmarks.SingleRequestResponseToRemoteEntity() in D:\Reposit
ories\olympus\akka.net\src\benchmark\Akka.Cluster.Benchmarks\Sharding\ShardMessageRoutingBenchmarks.cs:line 112
at BenchmarkDotNet.Autogenerated.Runnable_1.__Workload()
at BenchmarkDotNet.Autogenerated.Runnable_1.WorkloadActionUnroll(Int64 invokeCount)
at BenchmarkDotNet.Engines.Engine.RunIteration(IterationData data)
at BenchmarkDotNet.Engines.EngineFactory.Jit(Engine engine, Int32 jitIndex, Int32 invokeCount, Int32 unrollFactor)
at BenchmarkDotNet.Engines.EngineFactory.CreateReadyToRun(EngineParameters engineParameters)
at BenchmarkDotNet.Autogenerated.Runnable_1.Run(BenchmarkCase benchmarkCase, IHost host)
--- End of inner exception stack trace ---
at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor, Boo
lean wrapExceptions)
at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters
, CultureInfo culture)
at System.Reflection.MethodBase.Invoke(Object obj, Object[] parameters)
at BenchmarkDotNet.Toolchains.InProcess.Emit.Implementation.RunnableProgram.Run(BenchmarkId benchmarkId, Assembly par
titionAssembly, BenchmarkCase benchmarkCase, IHost host)
ExitCode != 0 and no results reported
No more Benchmark runs will be launched as NO measurements were obtained from the previous run! Could you send them as a PR instead? |
So this is really an Akka.Remote performance issue. I replicated the actor hierarchy in its essence here: https://github.com/Aaronontheweb/RemotingBenchmark BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1165 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.302
[Host] : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
DefaultJob : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
We can shave off maybe 25-30% of the performance overhead by improving the sharding system's efficiency of message handling, but it still comes down to Akka.Remote. What's interesting here is that the benchmark you designed is an absolute worst case performance scenario for Akka.Remote: this use case is pretty interesting too because:
Factor 3 is the most expensive to overcome - for the sake of comparison, if I change [Benchmark]
public async Task SingleRequestResponseToRemoteEntity()
{
for (var i = 0; i < MsgCount; i++)
await _sys2Remote.Ask<ShardedMessage>(_messageToSys2);
} To [Benchmark]
public async Task SingleRequestResponseToRemoteEntity()
{
var tasks = new List<Task>();
for (var i = 0; i < MsgCount; i++)
tasks.Add(_sys2Remote.Ask<ShardedMessage>(_messageToSys2));
await Task.WhenAll(tasks);
} The performance profile changes to: BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19041.1165 (2004/May2020Update/20H1)
AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.302
[Host] : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
DefaultJob : .NET Core 3.1.17 (CoreCLR 4.700.21.31506, CoreFX 4.700.21.31502), X64 RyuJIT
That looks more like it to me - all of the tasks were completed in the same order but they were just all allowed to start at the same time. If I change this to use Here's what I don't understand about your benchmark @carl-camilleri-uom - you ran this:
And got
Is |
@Aaronontheweb thanks for this information. First of all, apologies for the initial indication that boxing/unboxing was causing the issue - it was a red herring as indeed with your benchmark on HashCodeMessageExtractor I'm not able to replicate any latency. Secondly, thanks also for the details. provided. With regards to the workload submitted by Thus I have also tried tests using I have been testing some further approaches as follows and the repo at https://github.com/carlcamilleri/benchmark-akka-cluster has been updated as follows:
I guess therefore my question is whether there is perhaps a better way to approach this problem? Basically, the problem at hand could be described simply as the need to implement a REST API that gives back the details of a business entity that is cached within a sharded cluster. The API consumers are of course independent and the API is synchronous i.e. the API consumer would need to wait for the response from the API. If I understand well, this is what is introducing the problem (at least with my approach) i.e. where the different (parallel, mutually-exclusive) requests on the API are being handled in a serial manner on the cluster. Thank you |
@to11mtm @Aaronontheweb thanks for the analysis in #5230 . Just to confirm do we expect this to improve performance even in the case of the following approach? : For reference I have benchmarked this scenario (https://github.com/carlcamilleri/benchmark-akka-cluster/blob/master/Startup.cs) which is still performing poorly Thanks |
#5320 would theoretically improve performance issues around remote asks. I don't think they would help in the case of the scenario in your benchmark. Looking at said benchmark however, I would suggest:
|
Version Information
Version of Akka.NET? 1.4.23
Which Akka.NET Modules? Akka.Cluster.Sharding
Describe the performance issue
A minimum viable repo that reproduces the issue is at https://github.com/carlcamilleri/benchmark-akka-cluster
Running two n2-standard-8 nodes in GCP (8 CPU @ 2.80 GHz and 32GB RAM) with Windows Server 2019 in GCP ("instance-1" and "instance-2"), and a third machine to run the benchmarks from
First check:
curl http://instance-1:5000/5
Response:
akka://ping-pong-cluster-system/system/sharding/PingPongActor/13/5(pid=2372,hostname=instance-2)
Therefore entity id 5 actor is hosted on instance-2 server
wrk -t48 -c400 -d30s http://instance-2:5000/5
wrk -t48 -c400 -d30s http://instance-1:5000/5
For interest I've also repeated test (1) (i.e. workload on the endpoint which requests the local actor) but with
serialize-messages = on
, and the result is:So Hyperion serialisation drops throughput from >111k to >85k, which is probably expected
Data and Specs
ASKing a local actor I get >111k req/s throughput, but ASKing a remote actor drops throughput to 2.4k req/s.
Expected behavior
Cross-Machine communication in Cluster Sharding expected to be faster.
Actual behavior
Cross-Machine communication in Cluster Sharding seems to be extremely slow and unusable for my use case (an OLTP workload)
Environment
.NET 5.0
Windows Server 2019
n2-standard-8 machine in GCP (8 CPU @ 2.80 GHz and 32GB RAM)
The text was updated successfully, but these errors were encountered: