Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic issue with RedLock.GetHost #112

Open
olivierboucher opened this issue Mar 7, 2024 · 4 comments · May be fixed by #113
Open

Sporadic issue with RedLock.GetHost #112

olivierboucher opened this issue Mar 7, 2024 · 4 comments · May be fixed by #113

Comments

@olivierboucher
Copy link

Hi,

We're running into this issue every now and then on our production environment:

System.AggregateException: One or more errors occurred. (The specified endpoint is not defined (Parameter 'endpoint'))
 ---> System.ArgumentException: The specified endpoint is not defined (Parameter 'endpoint')
   at RedLockNet.SERedis.RedLock.GetHost(IConnectionMultiplexer cache)
   at RedLockNet.SERedis.RedLock.LockInstance(RedisConnection cache)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass32_0`2.<ForEachWorker>b__0(Int32 i)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`2.<ForWorker>b__1(RangeWorker& currentWorker, Int64 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Parallel.<>c__DisplayClass19_0`2.<ForWorker>b__1(RangeWorker& currentWorker, Int64 timeout, Boolean& replicationDelegateYieldedBeforeCompletion)
   at System.Threading.Tasks.TaskReplicator.Replica.Execute()
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.TaskReplicator.Run[TState](ReplicatableUserAction`1 action, ParallelOptions options, Boolean stopOnFirstFailure)
   at System.Threading.Tasks.Parallel.ForWorker[TLocal,TInt](TInt fromInclusive, TInt toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
--- End of stack trace from previous location ---
   at System.Threading.Tasks.Parallel.ThrowSingleCancellationExceptionOrOtherException(ICollection exceptions, CancellationToken cancelToken, Exception otherException)
   at System.Threading.Tasks.Parallel.ForWorker[TLocal,TInt](TInt fromInclusive, TInt toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
   at RedLockNet.SERedis.RedLock.Acquire()
   at RedLockNet.SERedis.RedLock.Start()
   at EverflowIdentity.Services.DistributedNewKeyLock`1.LockAsync(Int32 millisecondsTimeout) in /app/EverflowIdentity/Services/DistributedNewKeyLock.cs:line 26

It seems to happen randomly and sometimes months go by without encountering the issue.

Here is how we initialize the factory:

            _multiplexer = ConnectionMultiplexer.Connect(Configuration["RedisHost"]);
            var endPoints = new List<RedLockMultiplexer>
            {
                _multiplexer,
            };
            _redLockFactory = RedLockFactory.Create(endPoints);

            services.AddSingleton(_redLockFactory);

The code creating the lock:

                _lock = _redLockFactory.CreateLock("identity_key_lock", TimeSpan.FromMinutes(1),
                    TimeSpan.FromMinutes(1),
                    TimeSpan.FromMilliseconds(200));

                if (!_lock.IsAcquired)
                {
                    return Task.FromResult(false);
                }

                return Task.FromResult(true);

Any idea what could be the root cause of this sporadic issue or how to mitigate against it?

@Tasteful
Copy link

Tasteful commented Jun 24, 2024

We are getting the same issue in one of our production environment and I have been able to track down the reason.

First, we are using redis sentinel with an master-replica setup in HA, this is running as a stateful set in kubernetes with the bitnami/redis helm-chart as installation. We have configure the useHostnames: false when the helm is deployed, this to get the IP instead of host-names. I have read after this was setup that it may be better to use the hostname when running services inside kubernetes but haven't have time to try that out yet.

Every now and then, the redis pod inside kubernetes is re-scheduled into a new worker or the redis-deployment is updated so the pod is recreated. Some times, the pod is getting the same IP back and then it works, but often they are getting a new IP back. Sentinel keeping track of what IP that are active and should be used, but the sentinel also keep a list of old IPs that not is active (bitnami/charts#5418) the IConnectionMultiplexer.GetEndPoints() returns a list with all known endpoints, even if we not can connect to them, when the IConnectionMultiplexer.GetServer(endPoint) is executed it will throw an ArgumentException if the endpoint isn't connected and this will blow up and make the lock throw exception instead of real result.

I copied the GetHost method into a controller that writing out the result of the information and after I added the try/catch as in the linked pr the result looks like

10.42.12.167:6379 (Cant connect to host), 10.42.12.208:6379 (Cant connect to host), 10.42.34.52:6379 (Cant connect to host), 10.42.34.67:6379 (Cant connect to host), 10.42.12.208:6379 (Cant connect to host), 10.42.12.67:6379 (master, disconnected), 10.42.34.52:6379 (Cant connect to host), 10.42.12.167:6379 (Cant connect to host), 10.42.34.128:6379 (master, disconnected), 10.42.34.52:6379 (Cant connect to host), 10.42.34.67:6379 (Cant connect to host), 10.42.12.67:6379 (master, disconnected), 10.42.34.165:6379 (master), 10.42.34.78:6379 (slave), 10.42.34.128:6379 (master, disconnected), 10.42.12.60:6379 (slave), 10.42.12.67:6379 (master, disconnected), 10.42.34.165:6379 (master), 10.42.34.78:6379 (slave), 10.42.34.128:6379 (master, disconnected), 10.42.12.60:6379 (slave), 10.42.12.67:6379 (master, disconnected)

In general I think this logging can be improved (or cached) with better result, now the connection information will not be clear and it will take resources for each execution that using the GetHost method.

@Tasteful Tasteful linked a pull request Jun 24, 2024 that will close this issue
@Tasteful
Copy link

@samcook Do you have time to look into this and the linked PR? Maybe a patch release with this?

Our workaround with using host names instead of IPs did not work because we run into another issue that was setting redis sentinel in tilt-mode, so we need to reverted that change.

@samcook
Copy link
Owner

samcook commented Jun 26, 2024

Hi @Tasteful,

I've had a look and managed to reproduce the issue - it seems like there's an issue with StackExchange.Redis when it loses its current sentinel connection (and your redis instances are on ephemeral IPs, like in Kubernetes), after reconnecting it seems to still retain some endpoints that it doesn't have matching 'server' entries for.

Anyway, the proposed PR looks like it's probably a reasonable solution. I'll take a look at getting that merged in and pushing out a new release tomorrow.

As an aside though, looking at the StackExchange.Redis behaviour, if you can solve the tilt mode problem it's probably best if you can use the hostname mode, as over time StackExchange.Redis seems to ends up with more and more of these phantom connections in its list of endpoints.

@Tasteful
Copy link

Thanks!

Yes, the tilt mode actual exists in another bitnami/charts#9689 and the solution on that is to use ip-addresses instead of hostnames :)

Earlier today I created my own version of the Redlock dll and injected in the deployment pipeline and have from that point no logs about distributed lock exceptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants