-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpoints marked as not selectable even though Connection is established on AWS ElasticCache clustered mode enabled #2265
Comments
We have discovered similar problem with ServerEndPoint connection to Redis Cluster. Maybe it is related to your case. We have a cluster with 3 primary and 3 replica nodes. So 6 ServerEndPoint client instances should be created.
This sequence leads to situation when 5 of 6 endpoints are marked as unselectable. This breaks Redis client functionality because it tries to send commands operating on slots which are not belonged to the single Redis node marked as selectable. After that we got "EXECABORT Transaction discarded because of previous errors." error because the transaction is performed on wrong cluster node (and the redirection is disabled for transactions by default). It is hard to extract the working code demonstrating the problem from our sources but I tried to explain it as much detailed as it is possible. |
It is posssible that those tasks that wait for ConnectedEstablished state are never completed because server support subscription but subsription endpoint is not used.
Where Then if subscription is null meaning it was never created it will never call And this happens exactly because like @VladimirKhil says "ConnectionMultiplexer.ActivateAllServers() having only 1 node in its server snapshot (other 5 nodes are discovered later)" |
We suspect that this may be an additional instance of #2251, which we are investigating and discussing currently |
Looking at what is happening here, our hypothesis is that pub/sub simply isn't enabled in this environment; if this is correct, a pragmatic workaround here might be to simply tell the muxer to not try pub/sub - which can be done by adding |
Pub/sub is not disabled. Without it hangfire would stop working. We also use pub/sub from code. My hypothesis is that tasks do not complete because bridge for subscription is not created when interactive bridge is fully established and that's why lock monitors are never released. |
We use Pub/Sub too but not immediately during connection time - some time later. |
I didn't see this issue #2251 before but seems like we had made exactly same discovery independently and point to same place.
|
i concur - i believe this is the same issue and we indeed came up with the same conclusion. we too use pub/sub extensively, so i don't think it points in the missing pub/sub support direction. the issue is specific to a) clustered mode enabled and b) remaining clustered endpoints are discovered and not provided in the connection string. additionally, the issue is a regression in 2.5.* - older driver works just fine in the same environment. |
Thanks to @iteplov a fix is on MyGet as 2.6.70 now - if anyone could test this version to make double sure there's nothing extra odd against AWS, I'll get it on NuGet proper. If you're able to test, it'd be hugely appreciated. |
We are using ElasticCache on AWS with clustered mode enabled with two shards and two nodes in each (primary and replica).
When calling
connection = ConnectionMultiplexer.Connect(configurationOptions, Console.Out);
in many cases endpoints are marked as not selectable (DidNotRespond flag) in second iteration when endpoints are discovered from cluster even though their status is always ConnectedEstablished.
The heartbeat timer will clear that flag, but we need to do a dirty workaround
to wait for servers to be selectable. We need primary endpoint to be connected and selectable as soon as ConnectionMultiplexer is created otherwise we will get errors like "Command cannot be issued to a replica: "
Waiting less than a second clears that flag and everything start to work.
Is there any reason behind to mark endpoints as not selectable when connection is already established?
and then clear it OnHeartBeat:
Interesting part that increasing connectTimeout does not help at all.
Here are the logs:
The text was updated successfully, but these errors were encountered: