known pattern flow #5

royaltm · 2013-02-25T03:32:50Z

The pattern from the official SETNX doc used in @kenn's implementation is flawed.
The implementation itself however is really nice and I like it.
Nevertheless my bad habit of testing other people's ideas (especially before implementing them and using on production) pushed me to play a little bit with redis-mutex gem.

Unfortunately the setnx/getset pattern may lead to the "phantom starvation" (what I call it) of the two (or more) concurrent processes (or threads) when both try to enter the critical section at the wrong time.

Here's a simple stress test:

require 'redis-mutex'

REDIS_OPTS = {}

lck = '__STRESSTEST__.lck'
Redis::Classy.db = Redis.new(REDIS_OPTS)
rmutex = Redis::Mutex.new(lck, expire: 5, block:10000, sleep:0.01)
10000.times do |i|
  rmutex.with_lock do
    print "\r %6d" % i
    sleep rand/100
  end
end

Now try to spawn two terminal windows (or create split window in tmux) and run the above script on both of them.
When you start the first one, you'll see counting.
But in a short moment (or even at the beginning) when you run it the second time the counting stops periodically.
The scripts will eventually reach 9999 (if you are patient enough) but mostly sleeping, waiting for lock to expire.

I might suppose @kenn is already aware of that but since there is no BUGS/LIMITATIONS section in README file I wrote this issue for the record.

I also think it is ours (the coders') responsibility to warn others about known issues and limitations especially when you are the keeper of the quite popular rubygem name. Noblesse Oblige (and popularity should too) in the open source crowd. :)

Here are the reconstructed steps behind the vulnerability:

For clarity there are three actors in the example (A, B, C) though C could also be A.

now = 1
A has locked the resource (with expire_at=2)
A tries to unlock, at the same time B tries to lock (with expire_at=3)
A: get == 2 ? YES
B: setnx(3) ? NO
A: del
B: get > now ? NO (get is 0)
C now tries to lock with (expire_at=4)
C: setnx(4) ? YES -> C is the owner now
B: getset(3) <= now ? NO
at the end B is not the owner but has managed to alter the lock

therefore C won't unlock:
C: get == 4 ? NO (get is 3)
so the lock remains staled and all the hungry processes must wait until it expires

kenn · 2013-03-06T06:18:11Z

Excellent catch, @royaltm!

So it seems that if we change this logic:

B: get > now ? NO (get is 0)

to this:

B: get (is nil or) > now ? YES (get is nil)

the problem would be solved?

nviennot · 2013-03-07T18:48:46Z

This should solve the issue: #6

royaltm · 2013-03-07T22:48:41Z

Hello!
Your patch solves the issue in this particular example. However it's only a petty trick to workaround the problem. The problem with redis-mutex pattern is that read - check - write operations are not atomic. They are always prone to race conditions.

I've just imagined another example that leads to much more harmfull situation:

now = 2
A has locked the resource but it's key value has already expired (with expire_at=1)
B tries to lock
B: setnx(3) ? NO
B: get > now ? NO (get is 1)
A tries to unlock
A: get == 1 ? YES
B: getset(3) <= now ? YES
A: del

at the end B seems to be the owner of the lock but virtually the key's been deleted by A. In this case B has acquired the lock on premature assumption. It's much worse cause B is now fiddling in the unprotected critical section.

Of course you might still want to go that way adjusting this algorithm forewer hoping that no other race condition pattern emerges.

However if you want redis-mutex to be truly robust IMHO you should either:

make sure no condition race is possible using redis optimistic transactions
transfer the read - check - write logic to server side scripting (no other command is possible on redis while script is operating)

I would start with changing unlock in the following manner:

WATCH key
GET key == me ?
YES: MULTI
     DEL key
NO:  UNWATCH

Please look here.
This one has well designed locking pattern.

nviennot · 2013-03-07T23:17:54Z

The unlock test is indeed racy, it's another issue. The watch solves it yes.
How about a lua script to avoid roundtrips?
https://github.com/crowdtap/promiscuous/blob/master/lib/promiscuous/redis.rb#L132-L151

nviennot · 2013-03-07T23:21:51Z

Scrap that, it's racy with the a race on the getset. (@expires_at won't be the same).
I guess I'll go with a lua script for both sides. (I need to take multiple locks at the same time)

nviennot · 2013-03-07T23:39:20Z

https://github.com/crowdtap/promiscuous/blob/master/lib/promiscuous/redis.rb#L116-L154
lock/unlock is now a LUA script. Much easier, and probably faster (only one network roundtrip for each operation).

royaltm · 2013-03-07T23:43:37Z

for multiple locks you might take a look at my gem: (EM only) https://github.com/royaltm/redis-em-mutex

nviennot · 2013-03-07T23:51:44Z

I don't like EM, I like Celluloid, and I need something thread safe.
Quite honestly, this code is unreadable: https://github.com/royaltm/redis-em-mutex/blob/master/lib/redis/em-mutex/script_handler.rb#L162-L213 way too much complexity for me to verify that this is working as intended.
But thanks :)

royaltm · 2013-03-08T00:11:53Z

probably depends on who's reading it :), but no problem

kenn · 2013-03-08T01:11:12Z

Number of projects are still using Redis 2.4 and Lua scripting is not supported with them. I would go with optimistic locking with WATCH at this time. The Lua-based logic would be a nice addition for Redis 2.6, though.

I'm going to change this code in try_lock

return false  if get.to_f > now

to this

return false  if (old_value = get).nil? or old_value.to_f > now

and fix the first case.

Then for unlock, I'll try implementing the WATCH command. How does that sound?

nviennot · 2013-03-08T01:24:21Z

That sounds good, but you are still screwed, because you can getset() a lock that you don't own.
And so the unlock check doesn't work on a lock you've recovered while someone was racing with you.

kenn · 2013-03-08T01:31:53Z

The point of this fix is that you never get to getset, as it returns false immediately. In other words, the get checks two things - if the lock is still effective (get.to_f > now) or anyone released the lock in the middle (get.nil?). Am I still missing something?

nviennot · 2013-03-08T01:34:17Z

After that, you need getset to acquire an expired lock. If you raced with someone while doing the getset, unlock is broken, because the lock won't contain @expired_at

royaltm · 2013-03-08T01:55:17Z

proposition of the locking pattern:

until setnx # loop until succeeded, this is the only place where we can be sure locked key belong to us
  watch do
    if (old_value = get).nil? # try again it's probably free now
      unwatch
      next
    end
    if old_value.to_f > now
      unwatch
      return false # someone else keeps valid lock, leave now
    else # get < now - possibly expired
      # now try to delete expired lock
      return false unless multi {|m| m.del } # someone else has messed with this key, leave now
      # or try again
    end
  end
end
return true

together with the unlock above will solve the thing

nviennot · 2013-03-08T01:57:55Z

@royaltm 👍

nviennot · 2013-03-08T01:58:57Z

A caveat however, how does watch work when multiple threads are using the same redis connection?

royaltm · 2013-03-08T02:00:10Z

RTFS https://github.com/redis/redis-rb/blob/master/lib/redis.rb#L1995 - synchronize to the rescue

nviennot · 2013-03-08T02:03:41Z

I did read the fucking sources.

Thread1: watch
Thread2: get XX
Thread1: get YY

Thread1 will have the "get" of thread2 in its watch.
Point is, watch doesn't work well with threads and single redis connection.

royaltm · 2013-03-08T02:07:04Z

F stated for Fine ;)
"doesn't work well" why assume? watch with block is secured with monitor, so only one thread will watch and do stuff inside at the same time

nviennot · 2013-03-08T02:09:09Z

The synchronization that you get is this one:

Thread1: lock.synchronize { watch }
Thread2: lock.synchronize { get XX }
Thread1: lock.synchronize { get YY }

Thread1 did not expect to have XX watched. That's a problem.

royaltm · 2013-03-08T02:11:03Z

nope, I see it differently:
the synchronization is this actually, as yield is inside synchronize block in watch
Thread1: lock.synchronize { watch do blablabla end }
Thread2: lock.synchronize { watch do blablabla end }

nviennot · 2013-03-08T02:12:02Z

OHhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh ok.
I misread the sources :D

royaltm · 2013-03-08T02:12:52Z

don't misread your sources then :)

nviennot · 2013-03-08T02:12:52Z

so 👍 for your solution, I don't see any issues with it.

royaltm · 2013-03-08T02:13:38Z

the thing is, if and how watch/multi is supported in Redis::Classy? @kenn ?

nviennot · 2013-03-08T02:15:00Z

It's using method missing, so it should be just fine.
I don't like Classy though :)

royaltm · 2013-03-08T02:19:11Z

Every man to his taste.

nviennot · 2013-03-08T02:19:42Z

Yes

mkristian · 2016-12-18T12:47:47Z

is this still an open issue or is it solved ?

kenn · 2016-12-18T22:28:38Z

I've created a PR for this at #27 - if anyone can review, I'd appreciate it.

miyagawa mentioned this issue Mar 8, 2013

Support unlocking deadlocks with setnx/getset wallace/resque-lonely_job#4

Merged

kenn added a commit that referenced this issue Dec 18, 2016

Completely changing the way it works using WATCH command. #5

a460549

bbozo mentioned this issue Apr 12, 2017

Double execution of scheduled jobs when multiple hosts are running sidekiq-scheduler/sidekiq-scheduler#156

Closed

kenn mentioned this issue Mar 12, 2018

Completely changing the way it works using WATCH command. #5 #27

Open

kenn mentioned this issue Jun 2, 2018

Stalling in non-blocking mode? #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

known pattern flow #5

known pattern flow #5

royaltm commented Feb 25, 2013

kenn commented Mar 6, 2013

nviennot commented Mar 7, 2013

royaltm commented Mar 7, 2013

nviennot commented Mar 7, 2013

nviennot commented Mar 7, 2013

nviennot commented Mar 7, 2013

royaltm commented Mar 7, 2013

nviennot commented Mar 7, 2013

royaltm commented Mar 8, 2013

kenn commented Mar 8, 2013

nviennot commented Mar 8, 2013

kenn commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

mkristian commented Dec 18, 2016

kenn commented Dec 18, 2016

known pattern flow #5

known pattern flow #5

Comments

royaltm commented Feb 25, 2013

kenn commented Mar 6, 2013

nviennot commented Mar 7, 2013

royaltm commented Mar 7, 2013

nviennot commented Mar 7, 2013

nviennot commented Mar 7, 2013

nviennot commented Mar 7, 2013

royaltm commented Mar 7, 2013

nviennot commented Mar 7, 2013

royaltm commented Mar 8, 2013

kenn commented Mar 8, 2013

nviennot commented Mar 8, 2013

kenn commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

royaltm commented Mar 8, 2013

nviennot commented Mar 8, 2013

mkristian commented Dec 18, 2016

kenn commented Dec 18, 2016