Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlDistributedLock timeout issue? #5

Closed
clement911 opened this issue Mar 7, 2017 · 10 comments
Closed

SqlDistributedLock timeout issue? #5

clement911 opened this issue Mar 7, 2017 · 10 comments

Comments

@clement911
Copy link

clement911 commented Mar 7, 2017

Hi there.

Thank you for creating a cool library!
Good code too...

I use it in my web app with sql azure to make sure that certain long running operations can only be run once at a time.
It's been working fine but I just had a weird case and I thought maybe you may know something about it.

I have a block like this:

using (var @lock = new SqlDistributedLock("hello", connectionString).TryAcquire())
 {
         if (@lock == null)
               return;

         await doLongRunningOperationAsync()
}

doLongRunningOperationAsync can take quite a long time and I had a case in production where 2 requests managed to both acquire the exclusive lock!
The second one acquired the lock about 45 minutes after the first one had acquired the lock (it can take over an hour for this operation to complete...)

I use version 1.1.0.0 and as I understand it, when providing a connection string, an new connection and transaction will be created and the transaction will be the owner of the lock.

So it got me thinking. Maybe transactions have a timeout after which they are automatically closed? Maybe even the connection? In particular, sql azure can suffer transient issues and it is always a best practice to retry operations since they may fail because of temporary internal azure stuff.

So long story short, is it ok to use the SqlDistributedLock for long running operations???

@madelson
Copy link
Owner

madelson commented Mar 7, 2017

@clement911 Thanks for your interest in the library!

I have used this technique successfully for relatively long-running operations. I've even seen cases where a bug caused a lock to be held for days (to my dismay!). Obviously, if the database were shut down then you would lose the lock.

In your case, the example code looks fine (although note that you can use await ... TryAcquireAsync() instead): SQLAzure being the problem sounds reasonable.

This article lists out some of the Azure limitations that could be related:

Be mindful of SQL Azure's connection governor. SQL Azure has hard-wired limitations on usage and will throttle or disconnect database connections under heavy loads. Some common scenarios for connections being killed include transactions that use more than a gigabyte of logging, consume more than one million locks or which run for more than 24 hours; or sessions that use more than 16 MB for more than 20 seconds or which idle for more than 30 minutes. SQL Azure's throttling and limiting behaviors have been discussed in other venues as well. However, the better-tuned your application, the less likely it is to be throttled.

In this case, it seems like you might be running into the "idle" for 30 minutes limitations, although I'm not sure what the exact definition of "idle" would be. The 24-hour limitation also seems potentially problematic because of .NET connection pooling (although maybe the Azure pooling works around this for you under the hood somehow).

One approach would be to try out some of the different connection management options (https://github.com/madelson/DistributedLock#connection-management, introduced in 1.2) and see if the problem is specific to the transaction approach that was the default in 1.1.

You might be able to avoid the idle issue by using a lock scoped to an explicit SqlConnection object and launching an async task which pings on the connection periodically to prevent it from going idle. The thing to be careful of is to make sure the polling task finishes BEFORE we attempt to dispose the lock, since SqlConnections are not thread-safe. Something like:

using (var connection = new SqlConnection(connectionString))
{
    var handle = /* acquire lock on connection */;
    if (handle == null) { return; }

    var cts = new CancellationTokenSource();
    var pollingTask = Task.Run(() => PollConnectionAsync(connection, cts.Token));

    try
    {
         // do long running operation
    }
    finally
    {
        cts.Cancel();
        await pollingTask;
        handle.Dispose(); // safe since pollingTask is done
    }
}

private static async Task PollConnectionAsync(SqlConnection connection, CancellationToken token)
{
    while (!token.IsCancellationRequested)
    {
           // run a random query (e. g. SELECT 1) on connection

           try { await Task.Delay(TimeSpan.FromMinutes(10), token); }
           catch (OperationCancelledException) { } // don't throw; let the loop just return
    }
}

Please let me know if you try any of these and what you learn. For example, if the polling solution fixes the problem then we could easily add a new Azure-specific connection strategy as a built-in library feature.

@clement911
Copy link
Author

You put me on the right track and I found this:
https://github.com/HangfireIO/Hangfire/blob/ea3bc26a357dbedff52e93658013f3d52a05fe73/src/Hangfire.SqlServer/SqlServerDistributedLock.cs

This pretty much confirms your theory of the sql azure 30 minute idle limit, and they fixed it in Hangfire pretty much the same way that you suggested.

I guess that DistributedLock could use the same approach.
It could be an opt-in azure specific strategy but I'm thinking it should just be the default behaviour because this may happen in many other hosted environments, other clouds, etc...

The trippy thing is this comment:

private void ExecuteKeepAliveQuery(object obj)
        {
            lock (_lockObject)
            {
                try
                {
                    _connection?.Execute("SELECT 1;");
                }
                catch
                {
                    // Connection is broken. This means that distributed lock
                    // was released, and we can't guarantee the safety property
                    // for the code that is wrapped with this block. So it was
                    // a bad idea to have a separate connection for just
                    // distributed lock.
                    // TODO: Think about distributed locks and connections.
                }
            }
        }

If the connection is broken, then the lock is released and there is not much that can be done, but I think there should at least be an event that should be issued, so that, as a user of the app, I can trace this error.
I think a similar error event should be issued if when the connection is released (when disposing the lock), the state of the connection is not open. That means the lock was lost at some point, and that is of interesting to the user. I know I would to want to trace that in my app as this would help me understand weird behaviour where multiple requests have the same lock.

If we wanted to be smarter we could try to re-acquire the lock automatically when it is lost, but I think that's probably a can of worms to open later.
I'd suggest to start with this basic keep alive...

What do you think?

@madelson
Copy link
Owner

madelson commented Mar 8, 2017

@clement911 it's great to have that Hangfire example to confirm the theory. Were you able to try out the workaround code in your codebase to see if that fixed the issue for you (assuming it is reproducible)?

I definitely think it makes sense to support this at least via connection strategy. I'm more hesitant to support this by default since running extra queries and background threads adds overhead that, until your use-case, I hadn't seen the need for.

I think it probably makes sense to start by making this an option with the possibility of making it default over time (I'm taking a similar approach with the new ConnectionMultiplexing strategy which is probably a performance win most of the time). At minimum, locks created with explicit connections or transactions won't be able to do keepalive without violating thread-safety (since presumably the caller is continuing to use their connection as well). FWIW, this is the approach MSFT took for EF's Azure connection resiliancy.

An alternative idea would be to turn this behavior on based on detecting Azure connection strings. As someone who hasn't worked with Azure, I'm not sure if this is possible or not. Is there any identifying characteristic of azure connection strings?

As to your comment about wanting to know if the lock is dropped out from under you, I agree that this is a concern and it is something I have thought about in the past. As you say, re-acquisition is pretty risky. We may already have lost the lock, and even if we do successfully re-aquire there's no guarantee that someone else wasn't holding it in the meantime. Even with pure notification, there are challenges:

  1. Implementation-wise, we want to know if we lost the lock relatively quickly. Knowing 10 minutes after the fact may be too late! However, it's not clear to me how tight we can really get this. One potential approach would be to launch an asynchronous query executing WAITFOR, assuming that the connection loss will kill that query (I haven't tested). Of course, we need to stop this when it's time to release the lock. This could be done via query cancellation. All this adds overhead.

  2. API-wise we have to figure out how to expose this, preferably without burdening users who don't care about it with the cost. One potential approach is to expose a CancellationToken which represents the hold on the lock being canceled. Right now, we just return IDisposable as our lock handle so it's not clear where this would go. In the next major version, I'm planning to return a custom disposable type which would be a natural place to add something like this. We still have the issue that the approach outlined in Support within-connection/transaction locks #1 wouldn't work for user-provided connections and transactions due to thread-safety.

  3. Usage-wise, we still suffer from the fact that recovering from this situation will be very, very hard even when you know that it is happening. How frequently do you check the cancellation token? In what sorts of situations is there meaningful recovery action we can take, other than perhaps logging?

@clement911
Copy link
Author

No I haven't tried this work around yet, but I'd stay it is likely to work.

You may be right, let's have it as an opt-in strategy for now.

Regarding detecting azure connection string, I'm not too sure.
In my case the host name ends up in '.database.windows.net' but i'm not sure if that is reliable to test for that. I'm pretty sure that you can query the sql azure to ask it for its version though. That should be reliable but will incur an extra call after the lock is granted.

  1. How about listening to the ConnectionChanged event?

  2. I guess it could be an event in the custom disposable, or it could be an optional callback argument passed in the TryAcquire method. I think the latter maybe cleaner...

  3. I agree that you can't recover, but I'd still want to log it.

@madelson
Copy link
Owner

Unfortunately, the StateChanged event doesn't do what you'd want it to: it just fires when you call Open() or Close() on the connection object itself (see http://stackoverflow.com/questions/37442983/when-is-dbconnection-statechange-called). Pretty frustrating. I'll try to get a new version out with an option for keepalive. As mentioned, supporting connection monitoring would require a breaking API change and thus has to wait for the next major version (hopefully not too far in the future!).

In the meantime, it would be possible to try implementing a monitoring layer outside the library by passing in a custom SQL connection (similar to the keepalive workaround I showed above).

@clement911
Copy link
Author

Shame about the ConnectionChanged event...
Thank you for your help with this!

@madelson
Copy link
Owner

@clement911 I've released a new 1.3 version containing a keepalive implementation. This can be used by passing in the Azure connection strategy when constructing the lock. Let me know how it works!

For the connection monitoring API, I've created a separate issue to track (#6).

@clement911
Copy link
Author

You're a legend! The code looks really good too.
I will push that to production and monitor the results...
Thanks again.

@clement911
Copy link
Author

I haven't hit that issue since I deployed to prod so I guess it works well.
Thanks once again.

@madelson
Copy link
Owner

madelson commented Apr 1, 2017

Glad it's working and thanks for letting me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants