mv no semaphore

Status

merged to master - July 12, 2016
code complete - July 12, 2016
development started - July 8, 2016
concept RFC circulated - July 7, 2016

History / Context

Basho's "hot threads" thread pool started within eleveldb. eleveldb is Basho's Erlang to leveldb interface. Customers and software developers discovered that Erlang's schedulers would sometimes "collapse", take themselves off-line, if the scheduler threads directly executed leveldb operations. The solution was to have each Get, Write, or Iterate operation pass from the Erlang scheduler thread to an independent thread owned by eleveldb.

The initial implementation of independent threads within eleveldb was a traditional work queue and thread pool design. A condition variable guarded the work queue. Each Erlang operation request took ownership of the work queue's mutex, added the work item, then signaled the thread pool (via a related condition variable). One or more threads from the pool would wake, take ownership of the work queue's mutex, removed the work item, and execute the necessary task. This implementation worked. Unfortunately the Get and Write operations were now 20 to 25% slower. The Iterator operations were 45% slower. This was not a shippable implementation.

The "hot threads" implementation followed several months of experimentation. The hot threads implementation was only 5% slower on Get and Write operations. When combined with Iterator prefetch, the Iterator operations were slightly faster than the original Erlang scheduler thread operations. By the time Riak 1.4 shipped, there was a net performance gain across all operations due to other improvements outside of threading.

"hot threads" was initially used only with eleveldb. In the fourth quarter of 2013, the code was cleaned up and copied into leveldb (see https://github.com/basho/leveldb/wiki/mv-hot-threads). The code review related to the leveldb usage spotted a race condition. It was possible for a work task to get posted to the work queue and no thread start executing the task. The fix for the race condition was to create an extra thread based upon a semaphore that was independently notified of every task placed upon the work queue. Only leveldb's version of hot threads received this fix. eleveldb's version was not immediately fixed. Work by @agilesoftware in mid 2015 independently identified the same race condition in eleveldb (see https://github.com/basho/eleveldb/issues/141). eleveldb's race condition was fixed when both modules, eleveldb and leveldb, began using a single, common code base within leveldb.

The semaphore based correction to the race condition was an expedient solution. But it was also considered tacky. And the tackiness began to show itself once Apple's compiler began giving deprecation notices about semaphores. This branch eliminates the extra semaphore based thread with a cleaner fix to cover the race condition.

The race condition

The following code is from leveldb's util/hot_threads.cc (release tag 2.0.22), routine HotThreadPool::Submit():

        /** NOTE: this code executes on the Erlang scheduler threads **/
        else if (OkToQueue) 
        {
            // Execution here means no worker thread is available
            //  and work task should be queued for first available
            { 
                /** NOTE: proposed fix goes here **/

                item->m_QueueStart=Env::Default()->NowMicros();

                // no waiting threads, put on backlog queue
                {
                    SpinLock lock(&m_QueueLock);
                    inc_and_fetch(&m_WorkQueueAtomic);
                    m_WorkQueue.push_back(item);
                }
            }

            // to address race condition, thread might be waiting now
            FindWaitingThread(NULL, true);

            /** NOTE:  code from here down eliminated by proposed fix **/

            // to address second race condition, send in QueueThread
            //   (thread not likely good on OSX)
            if (m_QueueThread.m_ThreadGood)
            {
                 ....
            }   // if

And its race partner in HotThread::ThreadRoutine() within the same source file:

        /** NOTE:  this code executes on the independent worker threads **/
        // no work found, attempt to go into wait state
        //  (but retest queue before sleep due to race condition)
        else
        {
            MutexLock lock(&m_Mutex); // each worker has an independent m_Mutex

            m_DirectWork=NULL; // safety

            // only wait if we are really sure no work pending
            if (0==m_Pool.m_WorkQueueAtomic)
            {
                // yes, thread going to wait. set available now.
                m_Available=1;
                m_Condition.Wait();
            }    // if

It is conceivable, and possible to simulate via adding sleep() statements, that:

an Erlang thread executing the first block of code begins the "else if (OkToQueue)" section because no worker threads were available
before the Erlang thread reaches the line "inc_and_fetch(&m_WorkQueueAtomic);", all worker threads complete their current task and reach the "m_Condition.Wait()" in the second block of code.
the worker threads have no stimulus to pick up the queued work task. The work task is only picked up after a subsequent task enters the hot threads and completes. The worker thread's completion actions will then notice something on the work queue and execute it.

The previous fix has the independent QueueThread, which is semaphore based, be also notified of something added to the work queue.

Proposed new fix

eliminate all code related to the QueueThread
in the Submit() routine (first code block), lock the thread specific mutex of only the first worker thread

The first worker thread now replaces the QueueThread. The race condition still exists between the Erlang thread and all but the first worker thread. The Erlang thread and first worker thread can only access the work queue in a synchronized fashion. This guarantees that the at least one thread, the first worker thread, will eventually see the work item when it finishes its current task. Yes, there might be another worker thread that could process the task sooner. But synchronizing only one thread is far more efficient than synchronizing all threads to cover this possible, but unlikely race condition.

Branch Description

util/hot_threads.h & .cc

All declarations and code related to the QueueThread class is deleted. There was a conditional block within HotThreadsPool::Submit() that used the QueueThreads class member. That condition block (started with "if (m_QueueThread.m_ThreadGood)") is now gone.

Instead of the QueueThread block in HotThreadsPool::Submit(), the code now synchronizes with only the first thread in the pool by holding its mutex lock: "MutexLock lock(&m_Threads[0]->m_Mutex);". This lock forces the two threads into one of two states upon release of the lock, eliminating the unmanaged race condition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly