-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating a cluster router before the creating node has joined the cluster results in a broken router #1062
Comments
This is not a bug. It's an explicit design choice.
The router is meant to be a transparent message distribution point for other actors - it's inherently racy as a result of the high-throughput, mailbox-less design of routers themselves. If you create a clustered router and immediately send a message to it before it receives any gossip messages, it won't have any routees. Clustered routers are fundamentally different from every other type of router because they pick up their routees over time, as a result of changes in the network. Unlike local or Akka.Remote routers, they can't create their routees at start-time because clustered routers have to wait for gossip information before they know who they can route to.
Why is that weird? If your node doing the routing is up before any of the routees are up, what should the router do? Block on startup (bad?) Add a mailbox and queue messages (run out of memory and still not deliver any messages if a routee node never comes online?) Routers should stay dumb. Shouldn't over-engineer them to avoid some use cases where you have to be aware that you're using clustering, because ultimately the problem you're dealing with is a race condition that is totally in the hands of the end-user to control. I posted a solution on how to deal with this issue in a straightforward way in Gitter chat late last week, and we have a sample inside WebCrawler that we recently added. I agree that this can surprise developers - it certainly has caught me by surprise before, but this is intentional behavior. |
That is not the behavior Im describing here. So even if you send it messages after it joins the cluster, nothing goes through
Which they dont if it is created too early |
Ah crap, I misread this
Ok, if that's the case then that's a bug. Do you have some code that can reproduce that? I've not seen this one in the wild. |
Yeah, I've had nodes start up at bunches of random times and never ran into this problem - and I don't see how that this can even happen given how clustered routers subscribe to gossip events. Can you re-produce this in one of the multi-node tests? We have had issues with those lately for routers. |
Ok the problem is apparently not that the router is created before the node joins the cluster. But when I start all the applications at the same time, about 1 in 10 times, the node creating the router gets an error message and is then unable to use the router after this. We can also see that the node creating the router has joined the cluster and does see all the other nodes as it prints the cluster events to the console. Happy pathSad panda |
@rogeralsing can you confirm that the failing node was able to join the cluster in the Sad Panda scenario? |
Yes, see the "Member Up" events in the console, that window is the node creating the router. |
Could you try a |
There are two routees c0 and c1, both pointing to the worker node (there i a 2 per node setting in the confg) |
Ok, so in that case the router has routees. Why else wouldn't the messages go through? Can you try sending an identify message to the router and see if any of the routees reply back? Or if a dead letter gets logged on the remote-deploy target? |
Hmm. on happy path, when sending a Identify to each routee, they reply back with a correct ActorIdentity. On Sad Panda. I get the routers from GetRoutee, but none of them responds to Identify |
This could be an issue with the remote deployment in Akka.Cluster then - there's a chance that those remotely deployed actors weren't correctly reaped / restarted during the initial failed connection. Cluster Deathwatch does work a bit differently than Akka.Remote deathwatch. I would look there. One thing you can do is have a local actor on the Worker node send a message to the other actors who've been remotely deployed onto it using a wildcard actor selection. Give that a shot and see if they're alive |
Yepp. I've just verified that the workers are not started on the worker node in the failing scenario. In happy path, that message appears twice as there are two workers per node. |
Awesome! Found the bug! $5 says that it's actually an issue with the cluster deathwatch implementation and the deploying side not knowing that the deployment side killed off / never deployed those actors. |
I've located the problem to the node creating the router. If we are in Sad Panda mode. If I then start yet another creating node, this node will be able to communicate with both worker nodes. |
There seems to be some sort of connection problem to and from the creator node here. This all seems weird since clearly heartbeats are going through in both directions |
Have we confirmed that this is actually caused by #1071? |
This is unrelated to that PR. The PR is for the issue that was raised in the gitter chat a few days ago. But unrelated to this as the bug here doesnt even do a real deploy to the remote system |
If you try to create a cluster aware router before the node that is creating it have managed to join its seed node, the router will be permanently broken, even after the node manages to join the cluster.
This results in racy systems if all your nodes are started about the same time.
I'm marking this as a bug as I believe that the router is intended to adapt even under those conditions.
As it would be very weird if one would need to add some initialization code to wait until the node manages to join and then start creating your actors or routers.
The text was updated successfully, but these errors were encountered: