Skip to content

mv hot threads

Matthew Von-Maszewski edited this page Nov 24, 2013 · 29 revisions

Status

  • merged to master November 15, 2013
  • code complete November 6, 2013
  • development started October 4, 2013

History / Context

The initial branch, mv-hot-threads1, contains the second of two fixes first released in the 1.4.2-turner branch. A second branch, mv-hot-threads2, contains the full implementation discussed here. Both branches have been merged to master.

Google's original leveldb contained one thread for background compactions and would perform writes to the recovery log from the user's thread. Google's leveldb stalls the user's thread during a write operation whenever the background compactions cannot keep up with the ingest rate. Riak runs multiple leveldb databases (vnodes). This compounded the Google stalls such that a stall could hold multiple user write threads for minutes. Riak has several error situations based upon 60 second timeouts that triggered failure cases due to the stalls. A few of the failure cases were cascading in nature. A series of incremental hacks to the original leveldb to eliminate stalls began in April of 2012.

The success of each hack was based upon two criteria: did a stall occur and what is the total time to ingest 500 million keys with 1024 bytes of data. The latter criteria extended over time to 750 million, to 1 billion, and to currently 2 billion keys for Riak 2.0 release. Beginning with the Riak 1.2 release, each hack was known to not stall on the testing platforms and to produce incremental improvement to the second criteria (increased total throughput).

The Basho leveldb hacks included:

  • add a second background thread for writing recovery data, thereby not writing to disk on user thread
  • add a third thread specialized in writing memory to level-0 files to shortcut higher level compaction blockage that might create stalls
  • add a fourth thread specialized in merging level-0 files to level-1 to again shortcut higher level compaction blockage that might create stalls
  • create five leveldb Env objects, each with four background threads, since compaction is largely CPU bound between CRC calculation and Snappy compression, i.e. now 20 background compaction/write threads (5 * 4 = 20)
  • create tiered locks across the five Env objects to give disk I/O priority to level-0 file creation and level-1 merging respectively
  • predict disk write rate and amount of compaction backlog so as to throttle each Write operation proportionally to prevent a stall scenario
  • prioritize the backlog queues for level-0 and general compactions to do the most critical compactions first

Each release starting with 1.2 successfully addressed both criteria based upon the scenarios known at the time. Yes, new stall scenarios occurred after 1.2 and later releases addressed those stalls while providing incremental throughput improvements. … and the resulting code in util/env_posix.cc and db/db_impl.cc looked pathetic ... it worked, but was not a source of pride. Additionally, it became obvious that any contested mutex (regular or read/write) caused degraded performance (likely from thread swap giving Erlang a chance to spin wait … this theory is not proven).

The development cycle for Riak 2.0 offered time to replace the incremental hacks with a fully engineered solution. The solution is hot threads. Hot threads require the user thread to use atomic operations to find and start a waiting thread. This avoids mutex contention that could steal a thread's time slice. When all threads are busy, the background task is only added to a backlog queue if that task is not "grooming" related.

Hot threads is an uncommon design pattern that works quite well under the Erlang VM (uncommon: we have never seen anyone use this design pattern before, but doubt it is unique / original). Basho's leveldb uses the hot threads pattern to allow:

  • simultaneous compaction of multiple levels within a single database (vnode)
  • simple characterization of required versus grooming compactions
  • dedicated, specialized thread pools that are shared by all databases (vnodes)
  • more lenient application of write throttle resulting in faster ingest rates
  • removal of the PrioritizeWork() function in db_impl.cc (a Basho specific routine)

Notes

Grooming versus Non Grooming Compactions

Google's compaction decisions are largely "this must compact now" decisions. This is not optimal in a production environment. One of the worst consequences of this decision logic occurs when levels reach their maximum size. Say a file compacts into level 3, which is now full. This causes a compaction into level 4, which is now full. This causes a compaction into level 5, which is now full. This causes a compaction into level 6. This cascading compaction condition is permanent. Almost every future compaction will cause this cascade.

Basho's leveldb categorizes compactions into grooming and non-grooming compactions. And adds a suggested, lower size threshold for each level in addition to Google's hard limit. A grooming compaction occurs whenever the lower threshold is crossed and there is available compaction capacity. The compaction cascading does still happen, but typically it is delayed when compaction activity is already high. This provides a cushion in compaction capacity and disk load.

Branch description

Much of the hot-threads code originated in the basho/eleveldb code base in the Janurary-February 2013 time frame. Source files and individual classes copied from basho/eleveldb were modified to better suit the leveldb usage. Later, it is likely that eleveldb will shift to using the classes within basho/leveldb to promote code reuse and ease maintenance.

include/leveldb/atomics.h

The contents of this file come from basho/eleveldb's c_src/detail.hpp. It provides platform independent wrappers for various atomic functions used in the hot-threads implementation. Initially, only Solaris and gcc (Linux/OSX) environments needed support. The need to support other environments could arise in the future.

The gcc compiler's atomics implementation lends well to the use of C++ templates for many of the data types. Solaris's implementation does not. The latter is why there are many type specific implementations.

There was discussion about using the C++ extension classes for atomic operations. This was rejected due to lack of knowledge concerning the level of implementation / support within our customer installations. It is likely a decision worth revisiting in the future.

util/env_posix.cc

This source file was the home for most of the Basho threading/compaction hacks. It therefore has seen the largest number of changes. Most of the changes were about removing code no longer necessary as hot-threads is implemented. Some liberty was taken in removing Google classes that have not been used by Basho for a long time, such as PosixMmapReadableFile.

The PosixMmapFile class was cleaned up to correct a race condition in the close operation. The initial hacks to this class for background writes were written early when only one Env object existed. Later, multiple Env objects exists and it was possible for one file's write and/or close operations to be spread across multiple Env objects. This lead to files closing before all data was written. A previous patch removed that potential from compaction threads, but left the potential for writes to the recovery and info logs.

The PosixMmapFile class now contains the ref_count_ member variable. The variable starts with 1 upon creation of the object. The creation of a background write object increments ref_count_. The completion of a background write object decrements ref_count_. A file close operation decrements ref_count_ to remove the starting 1 count from object initialization. Whichever operation, background write or foreground close, causes the ref_count_ to reach zero will actually close the file handle. The closing operation is wrapped in a new function ReleaseRef().

Previous Basho releases had used four slight variations of essentially the same background function: BGFileCloser, BGFileCloser2, BGFileUnmapper, and BGFileUnmapper2. The use of ReleaseRef() to clean up close operations allowed deletion of the first 3 variations.

The Env::Schedule() routine had become the home for many of the Basho compaction scheduling hacks. All of those are now gone. Env::Schedule() is now a simple wrapper to create a hot-threads object and submit to a hot-thread pool.

db/db_impl.h

Two member variables of the DBImpl class are now "volatile": DBImpl::imm_ and DBImpl::manual_compaction_. The volatile attribute in C/C++ is a directive to the compiler to not optimize access to this variable. The variable could change due to another thread. Therefore all accesses to the variable should reload from memory, not be optimized within a register. The volatile change is necessary to support multiple, simultaneous compactions within the same DBImpl database (vnode).

db/db_impl.cc

Google's compaction logic calls DBImpl::DeleteObsoleteFiles() toward the end of every compaction cycle. The routine compares the active set of files to the files that exist on disk. Any "inactive" files get deleted. With simultaneous compactions, some of the "inactive" files might be the work-in-progress of another thread's compaction. Now this routine only executes if its thread is the only thread running a compaction for this DBImpl object.

Basho's version of DBImpl::MaybeScheduleCompaction() had become a convoluted piece of code trying to evaluate the current most important compaction compared to one that might already be scheduled, but not executing. The new thread pools eliminated all need for the convoluted logic. Now there are only two cases: normal running case and manual compactions submitted via unit tests.

The DBImpl class contains the methods called by background threads to perform / manage compactions. The calling parameters change slightly with hot-threads. New methods DBImpl::BackgroundImmCompactCall() and BGImpl::BackgroundCall2() replace the previous compaction entry point, BGImpl::BackgroundCall(). The new entry point calls allow for slightly different calling parameters and the management of simultaneous compactions within the same DBImpl object. The DBImpl::running_compactions_ variable counts the number of active compactions so that DBImpl::~DBImpl() is able to know when it is safe to close the database and destroy the database object.

DBImpl::DoCompactionWork() previously called a Basho specific function DBImpl::PrioritizeWork(). PrioritizeWork() is no longer necessary. The function contained Google's code to potentially interrupt a compaction to flush a write buffer into a new Level-0 .sst file. It also contained Basho specific code to support tiered read/write locks that prioritized disk access (and thread time) for write buffer flushes and Level-0 to Level-1 compactions. Both of those prioritization duties are no longer necessary with hot-threads. PrioritizeWork() function no longer exists and its gThreadLock0 and gThreadLock1 global read/write locks are also gone.

db/version_set.cc

Version::UpdateStats() is an original Google routine that could initiate a compaction based solely upon how many times a given .sst file has been accessed. There are no comments as to the theoretical underpinnings of this compaction decision. Basho has seen these read based compactions cause disk overhead at inopportune times. This routine is therefore disabled until the theoretical basis is known and properly tuned based upon Basho / Riak usage patterns.

Basho's version of VersionSet::Finalize() had become the focal point of both compaction prioritization and write throttle backlog computation. The new thread pools (and other code adjustments) no longer limit a database to one compaction at a time. Therefore compaction prioritization becomes irrelevant, schedule all needed compactions. Finalize() is now called multiple times, allowing each pass to offer another compaction in a higher level. The only limitation is that sorted levels can only offer a compaction if both the level above and the level below do not have a pending compaction. Finalize()'s previous write throttle computation is now contained in an independent routine VersionSet::UpdatePenalty().

VersionSet::Finalize() categorizes all compactions as either "grooming" or "not grooming". "Not grooming" compactions mirror Google's original decision points for a compaction: number of files on unsorted levels or total bytes on sorted levels. "Grooming" compactions represent Basho's suggested compactions to help overall throughput. "Grooming" compactions never get queued. They either execute immediately or ignored. (Soon, the new mv-aggressive-delete branch will also add grooming compactions for .sst table files that contain a large number of deleted objects.)

VersionSet::PickCompaction() is the primary user of VersionSet::Finalize(). It is upgraded to call Finalize() in a loop as long as Finalize() has another compaction to offer. PickCompaction() submits each offered compaction to the appropriate hot-threads pool. As mentioned previously, the pool must have an available thread for "grooming" compaction. Otherwise the grooming compaction is dropped, not queued.

The VersionSet::LogAndApply() manifest management sequence had to change to support simultaneous compactions. Google's original code was perfectly conservative. The original code did not update the internal manifest until the disk based manifest successfully completed. And since disk operations can block, the global management mutex was released during the disk manifest update. This conservative code would not work when two compactions might be completing simultaneously. It was possible for the second compaction to update the internal manifest first, then the first compaction to later update the manifest such that the second was forgotten. Now the internal manifest updates first in the order of compaction completion. Writes to the external disk manifest follow, and are not subject to ordering issues.

port/port_posix.cc

Basho adds the Spin class as an alternative to the Mutex class. The Spin class is based upon the pthread_spinlock_t type. This pthread object is not available on all platforms. Where not available, the Spin class is simply an alias for the existing Mutex class. Basho uses the Spin class where multiple threads are likely to simultaneously change multivariable objects, but only need exclusive access for a short period. The performance cost of multiple, spin wait threads is less than the cost of the contending threads losing their timeslice to other Erlang threads.

util/mutexlock.h

The SpinLock class here parallels the Spin Class port_posix.cc. SpinLock has the same methods as MutexLock and can therefor be a drop in substitute in existing code.

util/hot_threads.cc

The HotThreadPool class originated in the basho/eleveldb project as the eleveldb_thread_pool class. The core class methods are unchanged.

The core idea is simple: contested mutexes are highly inefficient for leveldb within the Erlang VM environment. The traditional design of a single work queue that is watched by many threads was extremely slow in distributing work. Our guess is that threads waiting on a mutex lost not only their current timeslice, but were also delayed in getting a new timeslice once the mutex was available. The hot-threads design uses atomic operations to have each requesting thread attempt to find an available worker thread directly. If a worker is available, the requesting thread passes the task to the worker thread directly … "a hot exchange". The traditional backlog queue is only used when all worker threads are proven to be busy.

HotThread::ThreadRoutine() contains a race condition fix not present in the basho/eleveldb project. A Basho developer noted a potential race condition where work objects going to backlog queue might never initiate a thread. The solution was to add an additional thread based upon a semaphore synchronization object. The semaphore base thread will loop across the backlog queue as many times as there were objects placed in the backlog queue to ensure none is forgotten. The backlog objects may be processed by the hot-threads or the semaphore thread.

Not all implementation support anonymous semaphores, e.g. OSX. Therefore the code supports anonymous semaphores as the default and will switch to named semaphores upon failure of anonymous.

util/thread_tasks.h

ThreadTask is the work object managed by the HotThreadPool class. It is a C++ functor. Derived classes of ThreadTask route execution control to background entry points such as DBImpl::BackgroundCall2 and legacy routines formally called via Google's Env::Schedule().

Clone this wiki locally