Creating a cluster router before the creating node has joined the cluster results in a broken router #1062

rogeralsing · 2015-06-15T17:34:04Z

If you try to create a cluster aware router before the node that is creating it have managed to join its seed node, the router will be permanently broken, even after the node manages to join the cluster.
This results in racy systems if all your nodes are started about the same time.

I'm marking this as a bug as I believe that the router is intended to adapt even under those conditions.
As it would be very weird if one would need to add some initialization code to wait until the node manages to join and then start creating your actors or routers.

Aaronontheweb · 2015-06-15T20:54:56Z

This is not a bug. It's an explicit design choice.

I'm marking this as a bug as I believe that the router is intended to adapt even under those conditions.

The router is meant to be a transparent message distribution point for other actors - it's inherently racy as a result of the high-throughput, mailbox-less design of routers themselves. If you create a clustered router and immediately send a message to it before it receives any gossip messages, it won't have any routees.

Clustered routers are fundamentally different from every other type of router because they pick up their routees over time, as a result of changes in the network. Unlike local or Akka.Remote routers, they can't create their routees at start-time because clustered routers have to wait for gossip information before they know who they can route to.

As it would be very weird if one would need to add some initialization code to wait until the node manages to join and then start creating your actors or routers.

Why is that weird? If your node doing the routing is up before any of the routees are up, what should the router do? Block on startup (bad?) Add a mailbox and queue messages (run out of memory and still not deliver any messages if a routee node never comes online?)

Routers should stay dumb. Shouldn't over-engineer them to avoid some use cases where you have to be aware that you're using clustering, because ultimately the problem you're dealing with is a race condition that is totally in the hands of the end-user to control.

I posted a solution on how to deal with this issue in a straightforward way in Gitter chat late last week, and we have a sample inside WebCrawler that we recently added.

I agree that this can surprise developers - it certainly has caught me by surprise before, but this is intentional behavior.

rogeralsing · 2015-06-15T21:02:45Z

immediately send a message to it before it receives any gossip messages, it won't have any routees.

That is not the behavior Im describing here.
The router is dead completely, even after the node has joined the cluster.

So even if you send it messages after it joins the cluster, nothing goes through

because they pick up their routees over time

Which they dont if it is created too early

Aaronontheweb · 2015-06-15T21:07:07Z

Ah crap, I misread this

If you try to create a cluster aware router before the node that is creating it have managed to join its seed node, the router will be permanently broken, even after the node manages to join the cluster.
This results in racy systems if all your nodes are started about the same time.

Ok, if that's the case then that's a bug. Do you have some code that can reproduce that? I've not seen this one in the wild.

Aaronontheweb · 2015-06-15T21:10:15Z

Which they dont if it is created too early

Yeah, I've had nodes start up at bunches of random times and never ran into this problem - and I don't see how that this can even happen given how clustered routers subscribe to gossip events. Can you re-produce this in one of the multi-node tests? We have had issues with those lately for routers.

rogeralsing · 2015-06-15T21:19:07Z

Ok the problem is apparently not that the router is created before the node joins the cluster.
I've explicitly tried to start the failing node first and then wait and start the others, and it all recovers nicely under those conditions.

But when I start all the applications at the same time, about 1 in 10 times, the node creating the router gets an error message and is then unable to use the router after this.
See images below..
purple is routed messages going through to the worker
(And as seen in the logs of the worker, the time passed is well above any gating period.)

We can also see that the node creating the router has joined the cluster and does see all the other nodes as it prints the cluster events to the console.

Happy path

Sad panda

Aaronontheweb · 2015-06-15T21:31:20Z

@rogeralsing can you confirm that the failing node was able to join the cluster in the Sad Panda scenario?

rogeralsing · 2015-06-15T21:33:19Z

Yes, see the "Member Up" events in the console, that window is the node creating the router.

rogeralsing · 2015-06-15T21:44:58Z

Here with full logging on.
Everything looks fine imo, but no router messages going through to the worker.
I can only replicate this when starting all nodes at the same time

Aaronontheweb · 2015-06-15T21:50:38Z

Could you try a GetRoutees message on the router and log the results?

rogeralsing · 2015-06-15T22:04:47Z

rogeralsing · 2015-06-15T22:05:34Z

There are two routees c0 and c1, both pointing to the worker node (there i a 2 per node setting in the confg)
So that part looks correct

Aaronontheweb · 2015-06-15T22:09:58Z

Ok, so in that case the router has routees. Why else wouldn't the messages go through? Can you try sending an identify message to the router and see if any of the routees reply back? Or if a dead letter gets logged on the remote-deploy target?

rogeralsing · 2015-06-15T22:17:33Z

Hmm. on happy path, when sending a Identify to each routee, they reply back with a correct ActorIdentity.

On Sad Panda. I get the routers from GetRoutee, but none of them responds to Identify

Aaronontheweb · 2015-06-15T22:20:05Z

This could be an issue with the remote deployment in Akka.Cluster then - there's a chance that those remotely deployed actors weren't correctly reaped / restarted during the initial failed connection. Cluster Deathwatch does work a bit differently than Akka.Remote deathwatch. I would look there.

One thing you can do is have a local actor on the Worker node send a message to the other actors who've been remotely deployed onto it using a wildcard actor selection. Give that a shot and see if they're alive

rogeralsing · 2015-06-15T22:23:20Z

Yepp.

I've just verified that the workers are not started on the worker node in the failing scenario.
I added a console writeline to the actor constructor.

In happy path, that message appears twice as there are two workers per node.
In sad panda, the messages does not show up, despite that the router does have two routees

Aaronontheweb · 2015-06-15T22:26:09Z

Awesome! Found the bug!

$5 says that it's actually an issue with the cluster deathwatch implementation and the deploying side not knowing that the deployment side killed off / never deployed those actors.

rogeralsing · 2015-06-16T08:56:13Z

To make things even more strange.
I forgot to terminate the systems over night.
See the logs timestamps.

Sometime 10ish hours later, the routees did come alive, just to die soon after that.

rogeralsing · 2015-06-16T09:36:16Z

I've located the problem to the node creating the router.

If we are in Sad Panda mode.
And then spin up another worker node, no messages are passed onto that worker either.
Even though the creating node clearly sees the cluster events for the new worker.

If I then start yet another creating node, this node will be able to communicate with both worker nodes.

rogeralsing · 2015-06-16T14:56:16Z

There seems to be some sort of connection problem to and from the creator node here.
In Sad Panda, if the worker node tries to spin up an actor on the creator node, the deployment succeeds and the actor is started, no messages are however going through from the worker to the first node.

This all seems weird since clearly heartbeats are going through in both directions

Aaronontheweb · 2015-06-21T19:38:05Z

Have we confirmed that this is actually caused by #1071?

rogeralsing · 2015-06-21T20:04:14Z

This is unrelated to that PR.

The PR is for the issue that was raised in the gitter chat a few days ago.
if you call stop on a remote deployed actor, its impossible to redeploy using the same name as the old actor is not removed from the remote daemon.

But unrelated to this as the bug here doesnt even do a real deploy to the remote system

rogeralsing added confirmed bug akka-cluster labels Jun 15, 2015

Aaronontheweb removed the confirmed bug label Jun 15, 2015

rogeralsing added the confirmed bug label Jun 15, 2015

rogeralsing mentioned this issue Jun 16, 2015

Early outbound user messages prevents cluster nodes from joining on RPI(possibly mono or lowend HW) #874

Closed

rogeralsing mentioned this issue Jun 21, 2015

RemoteDaemon bug, not removing children #1068

Merged

garethbudden added a commit to garethbudden/akka.net that referenced this issue Jun 24, 2015

fixes akkadotnet#1062 nunit testkit create new actorsystem for each test

fc4e5a7

Aaronontheweb closed this as completed in 0a55437 Jun 26, 2015

Horusiath pushed a commit to Horusiath/akka.net that referenced this issue Jul 4, 2015

fixes akkadotnet#1062 nunit testkit create new actorsystem for each test

5e42146

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a cluster router before the creating node has joined the cluster results in a broken router #1062

Creating a cluster router before the creating node has joined the cluster results in a broken router #1062

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 16, 2015

rogeralsing commented Jun 16, 2015

rogeralsing commented Jun 16, 2015

Aaronontheweb commented Jun 21, 2015

rogeralsing commented Jun 21, 2015

Creating a cluster router before the creating node has joined the cluster results in a broken router #1062

Creating a cluster router before the creating node has joined the cluster results in a broken router #1062

Comments

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Happy path

Sad panda

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 15, 2015

Aaronontheweb commented Jun 15, 2015

rogeralsing commented Jun 16, 2015

rogeralsing commented Jun 16, 2015

rogeralsing commented Jun 16, 2015

Aaronontheweb commented Jun 21, 2015

rogeralsing commented Jun 21, 2015