Endpoints marked as not selectable even though Connection is established on AWS ElasticCache clustered mode enabled #2265

adomass · 2022-10-06T17:33:46Z

We are using ElasticCache on AWS with clustered mode enabled with two shards and two nodes in each (primary and replica).
When calling
connection = ConnectionMultiplexer.Connect(configurationOptions, Console.Out);
in many cases endpoints are marked as not selectable (DidNotRespond flag) in second iteration when endpoints are discovered from cluster even though their status is always ConnectedEstablished.
The heartbeat timer will clear that flag, but we need to do a dirty workaround


var attemptsLeft = 5;
var delay = TimeSpan.FromSeconds(1);
do
{
   var status = connection.GetStatus();

   if (!status.Contains("not in use"))
     break;

logger.LogWarning("Not all servers  selectable. Waiting for {Delay} for connection. Attempts {AttemptsLeft} left", delay, attemptsLeft);
Thread.Sleep(delay);
} while (attemptsLeft-- > 0);

to wait for servers to be selectable. We need primary endpoint to be connected and selectable as soon as ConnectionMultiplexer is created otherwise we will get errors like "Command cannot be issued to a replica: "

StackExchange.Redis.RedisConnectionException: ProtocolFailure on redishangfire.barnehage.testaws.visma.com:6379/Interactive, Writing/ReadAsync, last: HMSET, origin: WriteMessageToServerInsideWriteLock, outstanding: 1, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 10 of 10 available, in: 0, last-heartbeat: never, last-mbeat: 5s ago, global: 5s ago, v: 2.6.66.47313
StackExchange.Redis.RedisCommandException: Command cannot be issued to a replica: HMSET {hangfire}:recurring-job:IDocumentJobService.CreateDefaultTemplatesAsync
at StackExchange.Redis.PhysicalBridge.WriteMessageToServerInsideWriteLock(PhysicalConnection connection, Message message) in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 1483

Waiting less than a second clears that flag and everything start to work.
Is there any reason behind to mark endpoints as not selectable when connection is already established?

else
{
   server.SetUnselectable(UnselectableFlags.DidNotRespond);
   log?.WriteLine($"  {Format.ToString(server)}: Did not respond");
}

and then clear it OnHeartBeat:

  case (int)State.ConnectedEstablishing:
  case (int)State.ConnectedEstablished:
      var tmp = physical;
      if (tmp != null)
      {
          if (state == (int)State.ConnectedEstablished)
          {
             Interlocked.Exchange(ref connectTimeoutRetryCount, 0);
             tmp.BridgeCouldBeNull?.ServerEndPoint?.ClearUnselectable(UnselectableFlags.DidNotRespond);
          }

Interesting part that increasing connectTimeout does not help at all.

Here are the logs:

16:07:23.6432: Connecting (sync) on .NET 6.0.9 (StackExchange.Redis: v2.6.66.47313)
16:07:23.6599: redishangfire.my.domain.com:6379,password=*****,ssl=True,sslProtocols=Tls,abortConnect=False
16:07:23.6617: redishangfire.my.domain.com:6379/Interactive: Connecting...
16:07:23.6694: redishangfire.my.domain.com:6379: BeginConnectAsync
16:07:23.6712: 1 unique nodes specified (with tiebreaker)
16:07:23.6714: redishangfire.my.domain.com:6379: OnConnectedAsync init (State=Connecting)
16:07:23.6717: Allowing 1 endpoint(s) 00:00:05 to respond...
16:07:23.6723: Awaiting 1 available task completion(s) for 5000ms, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=2,Free=32765,Min=2,Max=32767), POOL: (Threads=4,QueuedItems=0,CompletedItems=332)
16:07:23.6782: Configuring TLS
16:07:24.0147: TLS connection established successfully using protocol: Tls
16:07:24.0546: redishangfire.my.domain.com:6379/Interactive: Connected 
16:07:24.0553: redishangfire.my.domain.com:6379: Server handshake
16:07:24.0556: redishangfire.my.domain.com:6379: Authenticating (password)
16:07:24.0645: redishangfire.my.domain.com:6379: Setting client name: ip-10-48-101-89(SE.Redis-v2.6.66.47313)
16:07:24.0662: redishangfire.my.domain.com:6379: Auto-configuring...
16:07:24.0673: redishangfire.my.domain.com:6379: Requesting tie-break (Key=__Booksleeve_TieBreak)...
16:07:24.0676: redishangfire.my.domain.com:6379/Interactive: Writing: GET __Booksleeve_TieBreak
16:07:24.0679: redishangfire.my.domain.com:6379: Sending critical tracer (handshake): ECHO
16:07:24.0679: redishangfire.my.domain.com:6379/Interactive: Writing: ECHO
16:07:24.0681: redishangfire.my.domain.com:6379: Flushing outbound buffer
16:07:24.0681: redishangfire.my.domain.com:6379: OnEstablishingAsync complete
16:07:24.0682: redishangfire.my.domain.com:6379: Starting read
16:07:24.0710: redishangfire.my.domain.com:6379: Auto-configured (INFO) role: replica
16:07:24.0713: redishangfire.my.domain.com:6379: Auto-configured (INFO) version: 6.2.6
16:07:24.0718: redishangfire.my.domain.com:6379: Auto-configured (INFO) server-type: cluster
16:07:24.0756: Response from redishangfire.my.domain.com:6379/Interactive / GET __Booksleeve_TieBreak: Error: MOVED 6202 rec1234abcdefgh-0001-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379
16:07:24.0758: Response from redishangfire.my.domain.com:6379/Interactive / ECHO: BulkString: 16 bytes
16:07:24.0759: redishangfire.my.domain.com:6379: OnConnectedAsync completed (From command: ECHO)
16:07:24.0769: All 1 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=6,Free=32761,Min=2,Max=32767), POOL: (Threads=8,QueuedItems=1,CompletedItems=1016)
16:07:24.0774: Endpoint summary:
16:07:24.0775:   redishangfire.my.domain.com:6379: Endpoint is ConnectedEstablished
16:07:24.0775:   redishangfire.my.domain.com:6379: Returned with success as Cluster replica (Source: From command: ECHO)
16:07:24.2786: rec1234abcdefgh-0002-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: OnConnectedAsync init (State=Connecting)
16:07:24.2788: rec1234abcdefgh-0001-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: OnConnectedAsync init (State=ConnectedEstablishing)
16:07:24.2789: rec1234abcdefgh-0002-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: OnConnectedAsync init (State=Connecting)
16:07:24.2790: rec1234abcdefgh-0001-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: OnConnectedAsync init (State=Connecting)
16:07:24.2791: Allowing 4 endpoint(s) 00:00:04.3930000 to respond...
16:07:24.2791: Awaiting 4 available task completion(s) for 4393ms, IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=3,Free=32764,Min=2,Max=32767), POOL: (Threads=8,QueuedItems=3,CompletedItems=1026)
16:07:28.6746: Not all available tasks completed cleanly (from ReconfigureAsync#1284, timeout 4393ms), IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=2,Free=32765,Min=2,Max=32767), POOL: (Threads=8,QueuedItems=0,CompletedItems=1079)
16:07:28.6754:   Server[0] (rec1234abcdefgh-0002-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379) Status: WaitingForActivation (inst: 11, qs: 0, in: 0, qu: 11, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
16:07:28.6755:   Server[1] (rec1234abcdefgh-0001-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379) Status: WaitingForActivation (inst: 11, qs: 0, in: 0, qu: 11, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
16:07:28.6755:   Server[2] (rec1234abcdefgh-0002-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379) Status: WaitingForActivation (inst: 12, qs: 0, in: 0, qu: 12, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
16:07:28.6755:   Server[3] (rec1234abcdefgh-0001-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379) Status: WaitingForActivation (inst: 12, qs: 0, in: 0, qu: 12, aw: False, in-pipe: -1, out-pipe: -1, bw: Inactive, rs: ReadAsync. ws: Idle)
16:07:28.6755: Endpoint summary:
16:07:28.6755:   rec1234abcdefgh-0002-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Endpoint is ConnectedEstablished
16:07:28.6756:   rec1234abcdefgh-0001-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Endpoint is ConnectedEstablished
16:07:28.6756:   rec1234abcdefgh-0002-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Endpoint is ConnectedEstablished
16:07:28.6756:   rec1234abcdefgh-0001-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Endpoint is ConnectedEstablished
16:07:28.6756:   rec1234abcdefgh-0002-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Did not respond
16:07:28.6756:   rec1234abcdefgh-0001-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Did not respond
16:07:28.6756:   rec1234abcdefgh-0002-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Did not respond
16:07:28.6756:   rec1234abcdefgh-0001-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Did not respond
16:07:28.6756: Cluster: 16384 of 16384 slots covered
16:07:28.6757: Endpoint Summary:
16:07:28.6757:   redishangfire.my.domain.com:6379: Cluster v6.2.6, replica; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
16:07:28.6760:   redishangfire.my.domain.com:6379: int ops=13, qu=0, qs=0, qc=0, wr=0, socks=1, non-pref=1; sub ops=4, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1
16:07:28.6786:   redishangfire.my.domain.com:6379: Circular op-count snapshot; int: 0+13=13 (1.30 ops/s; spans 10s); sub: 0+4=4 (0.40 ops/s; spans 10s)
16:07:28.6786:   rec1234abcdefgh-0002-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Cluster v6.2.6, primary; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
16:07:28.6787:   rec1234abcdefgh-0002-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
16:07:28.6787:   rec1234abcdefgh-0002-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub:
16:07:28.6787:   rec1234abcdefgh-0001-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Cluster v6.2.6, primary; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
16:07:28.6788:   rec1234abcdefgh-0001-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: int ops=11, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
16:07:28.6788:   rec1234abcdefgh-0001-001.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Circular op-count snapshot; int: 0+11=11 (1.10 ops/s; spans 10s); sub:
16:07:28.6788:   rec1234abcdefgh-0002-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Cluster v6.2.6, replica; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
16:07:28.6788:   rec1234abcdefgh-0002-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
16:07:28.6788:   rec1234abcdefgh-0002-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub:
16:07:28.6788:   rec1234abcdefgh-0001-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Cluster v6.2.6, replica; keep-alive: 00:01:00; int: ConnectedEstablished; sub: n/a; not in use: DidNotRespond
16:07:28.6788:   rec1234abcdefgh-0001-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: int ops=12, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=0, qu=0, qs=0, qc=0, wr=0
16:07:28.6789:   rec1234abcdefgh-0001-002.rec1234abcdefgh.620ymr.euw1.cache.amazonaws.com:6379: Circular op-count snapshot; int: 0+12=12 (1.20 ops/s; spans 10s); sub:
16:07:28.6789: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
16:07:28.6789: Starting heartbeat...
16:07:28.6792: Total connect time: 5,036 ms

The text was updated successfully, but these errors were encountered:

VladimirKhil · 2022-10-10T12:26:28Z

We have discovered similar problem with ServerEndPoint connection to Redis Cluster. Maybe it is related to your case.

We have a cluster with 3 primary and 3 replica nodes. So 6 ServerEndPoint client instances should be created.

When the client connects to the cluster (inside ConnectionMultiplexer.Connect() call), it calls ConnectionMultiplexer.ActivateAllServers() having only 1 node in its server snapshot (other 5 nodes are discovered later). So only a single ServerEndPoint with not-null interactive and not-null subscription fields is created.
After that other 5 nodes are discovered and 5 ServerEndPoint are created. This time only interactive fieled is initialized inside them. The subscription fields remain being zero (see ConnectionMultiplexer.GetServerEndPoint() called from ReconfigureAsync()).
After interactive being successfully connected to its corresponding node, ServerEndPoint.OnFullyEstablished() is called to activate the ConnectionMultiplexer completion handler (see ServerEndPoint.OnConnectedAsync()).
The key of the problem is the condition if (IsConnected && (IsSubscriberConnected || !SupportsSubscriptions)) inside ServerEndPoint.OnFullyEstablished(). IsSubscriberConnected is false because subscriber field is null but SupportsSubscriptions is true. So the endpoint is not marked as fully connected and callback is not called.
Because of that, else if (task.IsCompleted) condition inside ConnectionMultiplexer.ReconfigureAsync() is false. So server.SetUnselectable(UnselectableFlags.DidNotRespond); is called. And it is only cleared after the successful heartbeat run (which happens some time later).

This sequence leads to situation when 5 of 6 endpoints are marked as unselectable. This breaks Redis client functionality because it tries to send commands operating on slots which are not belonged to the single Redis node marked as selectable.

After that we got "EXECABORT Transaction discarded because of previous errors." error because the transaction is performed on wrong cluster node (and the redirection is disabled for transactions by default).

It is hard to extract the working code demonstrating the problem from our sources but I tried to explain it as much detailed as it is possible.

adomass · 2022-10-11T12:06:22Z

It is posssible that those tasks that wait for ConnectedEstablished state are never completed because server support subscription but subsription endpoint is not used.

 if (IsConnected && (IsSubscriberConnected || !SupportsSubscriptions))
 {
    // Only connect on the second leg - we can accomplish this by checking both
    // Or the first leg, if we're only making 1 connection because subscriptions aren't supported
    CompletePendingConnectionMonitors(source);
 }

Where public bool IsSubscriberConnected => subscription?.IsConnected == true;

Then if subscription is null meaning it was never created it will never call CompletePendingConnectionMonitors(source);

And this happens exactly because like @VladimirKhil says "ConnectionMultiplexer.ActivateAllServers() having only 1 node in its server snapshot (other 5 nodes are discovered later)"

mgravell · 2022-10-11T14:59:59Z

We suspect that this may be an additional instance of #2251, which we are investigating and discussing currently

mgravell · 2022-10-11T15:20:39Z

Looking at what is happening here, our hypothesis is that pub/sub simply isn't enabled in this environment; if this is correct, a pragmatic workaround here might be to simply tell the muxer to not try pub/sub - which can be done by adding ,$SUBSCRIBE= to the end of the connection-string. Is this possible to test? Thanks.

adomass · 2022-10-12T06:13:39Z

Pub/sub is not disabled. Without it hangfire would stop working. We also use pub/sub from code. My hypothesis is that tasks do not complete because bridge for subscription is not created when interactive bridge is fully established and that's why lock monitors are never released.
This could be an issue only in clustered mode enabled.
Since subscription bridge is created for configuration endpoint before cluster endpoints are even discovered.

VladimirKhil · 2022-10-12T07:40:56Z

Looking at what is happening here, our hypothesis is that pub/sub simply isn't enabled in this environment; if this is correct, a pragmatic workaround here might be to simply tell the muxer to not try pub/sub - which can be done by adding ,$SUBSCRIBE= to the end of the connection-string. Is this possible to test? Thanks.

We use Pub/Sub too but not immediately during connection time - some time later.

adomass · 2022-10-12T14:44:23Z

I didn't see this issue #2251 before but seems like we had made exactly same discovery independently and point to same place.
Now I am convinced that this is caused because subscription bridge is not created for cluster endpoints even if server does support them and this causes for tasks that await for connection never finish because of this if:

 if (IsConnected && (IsSubscriberConnected || !SupportsSubscriptions))
 {
    // Only connect on the second leg - we can accomplish this by checking both
    // Or the first leg, if we're only making 1 connection because subscriptions aren't supported
    CompletePendingConnectionMonitors(source);
 }

iteplov · 2022-10-12T18:38:05Z

i concur - i believe this is the same issue and we indeed came up with the same conclusion. we too use pub/sub extensively, so i don't think it points in the missing pub/sub support direction.

the issue is specific to a) clustered mode enabled and b) remaining clustered endpoints are discovered and not provided in the connection string.

additionally, the issue is a regression in 2.5.* - older driver works just fine in the same environment.

… discovery (#2268) Fix for #2251 and #2265 ensuring subscription connections are proactively created in all cases. Co-authored-by: Nick Craver <nrcraver@gmail.com>

NickCraver · 2022-10-16T00:02:45Z

Thanks to @iteplov a fix is on MyGet as 2.6.70 now - if anyone could test this version to make double sure there's nothing extra odd against AWS, I'll get it on NuGet proper. If you're able to test, it'd be hugely appreciated.

This was referenced Oct 12, 2022

Cluster Connect works partially, with a full timeout. #2251

Closed

2251: Fixes missing activation of the sub leg during initial endpoint discovery #2268

Merged

NickCraver linked a pull request Oct 15, 2022 that will close this issue

2251: Fixes missing activation of the sub leg during initial endpoint discovery #2268

Merged

NickCraver closed this as completed in #2268 Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoints marked as not selectable even though Connection is established on AWS ElasticCache clustered mode enabled #2265

Endpoints marked as not selectable even though Connection is established on AWS ElasticCache clustered mode enabled #2265

adomass commented Oct 6, 2022 •

edited

Loading

VladimirKhil commented Oct 10, 2022

adomass commented Oct 11, 2022 •

edited

Loading

mgravell commented Oct 11, 2022

mgravell commented Oct 11, 2022

adomass commented Oct 12, 2022 •

edited

Loading

VladimirKhil commented Oct 12, 2022

adomass commented Oct 12, 2022

iteplov commented Oct 12, 2022 •

edited

Loading

NickCraver commented Oct 16, 2022

Endpoints marked as not selectable even though Connection is established on AWS ElasticCache clustered mode enabled #2265

Endpoints marked as not selectable even though Connection is established on AWS ElasticCache clustered mode enabled #2265

Comments

adomass commented Oct 6, 2022 • edited Loading

VladimirKhil commented Oct 10, 2022

adomass commented Oct 11, 2022 • edited Loading

mgravell commented Oct 11, 2022

mgravell commented Oct 11, 2022

adomass commented Oct 12, 2022 • edited Loading

VladimirKhil commented Oct 12, 2022

adomass commented Oct 12, 2022

iteplov commented Oct 12, 2022 • edited Loading

NickCraver commented Oct 16, 2022

adomass commented Oct 6, 2022 •

edited

Loading

adomass commented Oct 11, 2022 •

edited

Loading

adomass commented Oct 12, 2022 •

edited

Loading

iteplov commented Oct 12, 2022 •

edited

Loading