-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Connect works partially, with a full timeout. #2251
Comments
For reference, same setup but with 2.2.88, same connection sequence, everything works:
|
For another reference, same setup with 2.6.66 and spelled out endpoints. everything works.
EDIT: initially grabbed it from 2.2.88 by mistake, now put the correct one. |
Thanks for the detailed info here! It may be 1-2 days before I get to poke hard at this (kiddo birthday tomorrow) but lots to go on, thanks a ton for that <3 |
Apologies on time here - a day after posting this said 8yo also brought COVID round 2 home for fun - still catching up on work and OSS here. |
no worries, hope you all are well. this is not extremely urgent, considering the circumstances. thank you for the heads up. |
@iteplov - can you let me know how did you enable the detailed logging ? |
something along these lines. not working code, i just typed it here but hopefully you get the gist var logger = new MyLogger();
var multiplexer = ConnectionMultiplexer.Connect(redisOptions, logger); public class MyLogger: TextWriter
{
public override Encoding Encoding => Encoding.Default;
public override void WriteLine(string value) => Debug.WriteLine(value);
public override void Write(string value) => Debug.Write(value);
} |
Thank you @iteplov |
Looking at what is happening here, our hypothesis is that pub/sub simply isn't enabled in this environment; if this is correct, a pragmatic workaround here might be to simply tell the muxer to not try pub/sub - which can be done by adding |
@mgravell i was not smart enough to figure out how to tweak the connection string but i made this change here, which i think should result in the same outcome: /// <summary>
/// The default commands specified by redis.
/// </summary>
public static CommandMap Default { get; } = CreateImpl(null, exclusions: new HashSet<RedisCommand> { RedisCommand.SUBSCRIBE, RedisCommand.PSUBSCRIBE }); it does indeed fix the issue - connections can be established. unfortunately, it breaks the app, since we do need pub/sub and been using it for some time. please see this comment as well #2265 (comment). i think these two issues can be combined as a dup.
|
i took a stab at it and this "works on my machine": #2268 |
Thanks @iteplov! Fix is on MyGet as 2.6.70 now - if you could test this version to make double sure there's nothing extra odd against AWS, I'll get it on NuGet proper right after <3 |
2.6.70 is confirmed to pass our internal tests and seems to be working fine in test environments. Thanks a bunch for your attention to this issue and the prompt release. |
Thanks a ton for all the help here, very much appreciated! |
AWS Redis 6.2.5 cluster 2 shards, each with replica, with a config endpoint. FWIW, the config endpoint is a DNS alias of a random cluster node, I believe.
Originally, a .NET Core app uses SE.Redis 2.2.88, configured with that single config endpoint. Everything works.
After upgrading to SE.Redis 2.6.66, random Redis commands started to return non-sensical errors that were reproducible somewhat in 50% of the cases.
E.g. a simple transaction with a single operation against a single random key would give EXECABORT in 50% of the runs. (was not the root issue but the specific reason for that was that in 50% tx would be sent to the server as an empty "multi"/"exec" pair, i.e. missing the actual op, according to
redis-cli monitor
).Additionally, the time of the initial
ConnectionMultiplexer.Connect
would be always aroundConnectTimeout
, regardless of the value. Very much similar to what is described in this issue here.All errors/delays would disappear if all 4 endpoints would be spelled out in configuration, i.e. the issue would take place only if the muxer has to discover remaining endpoints from the provided one.
The described issues were reproduced in tests, so what the app does and where it runs seems irrelevant. The configuration is mostly vanilla (dumped below) and didn't change during the upgrade. The app uses custom SSL callback that allows "*.amazonaws.com" certs, not sure if that makes a difference here.
Logs
SE.Redis 2.6.66 connection sequence (ConnectTimeout = 30s), from
ConnectionMultplexer.Connect
Theory
After debugging for some time, I believe the issue was introduced in #1947. Specifically this:
Even more specifically, this line, that was later slightly updated in #2001, but not in a way to help here (all my cluster nodes use subscriptions).
Super-specifically, during activation of the discovered endpoints, the connection sequence is awaiting on these tasks. These are TCS tasks that are set completed when this happens.
And this now (well, since 2.5.*) requires a Subscription
PhysicalBridge
present on that endpoint. For the life of me I don't see it being created during the Reconfigure flow anywhere but here, which clearly happens only for the initially provided endpoints.$1M Question
How come it's been out for that long and nobody noticed? Things go rather bad in this partially connected state. So I mostly hope that my theory is missing something obvious and all of that is fixable with a simple option flip.
Thanks for looking into this and all your work!
The text was updated successfully, but these errors were encountered: