1.4.2patch

Status

merged to master
code complete October 9, 2013
development started October 4, 2013

History / Context

A Riak CS customer experienced performance problems with the leveldb backend. The data written to leveldb varied in size from 2Mbytes to 30Mbytes, and it was updated (rewritten) often. The leveldb LOG file showed frequent "waiting 2" messages indicating leveldb had stalled write operations waiting upon the Imm write buffer to be flushed.

The user experienced pain was that timeouts were occurring. And there were error messages about fadvise and fdatasync errors in the syslog.

This branch targets two independent problems:

PosixMmapFile when used asynchronously by the recovery log has a race condition related to close operations. This caused the syslog error messages. It also implies the recovery log likely has minor corruption at the file end (which is only critical if the file is actually used in a database recovery).
The large key/value pairs, overwritten often, resulted in a high number of small, level-0 files. leveldb handles this situation poorly, and causes leveldb stalls / Riak timeouts in the production environment.

NOTE: This branch has not been "cleaned up". It intentionally contains dead code due to changes. It contains debug LOG messages and heavy debug syslog messages. The branch has not been tested in the customer environment yet and any of the "messy" items might be needed to clarify further work. Cleaning will happen before application to Riak 2.0 (and maybe a future 1.4 release).

Branch description

PosixMmapFile

The PosixMmapFile async problem was partially addressed in a previous patch. That patch corrected PosixMmapFile::UnmapCurrentRegion() to only use async unmapping if the file was a recovery log. The problem that remained (undetected) was that Env::Default()->Schedule() did not place sequential file operations on the same thread. It was therefore possible for an unmap with close operation to process prior to simple unmap operations that precede it.

This branch adds logic to maintain a reference count of active objects related to a file. The PosixMmapFile object is one type of "active object". A background BGCloseInfo object is the other. The reference count is malloc'd in memory and works similar to an std::shared_ptr. The difference is that std::shared_ptr is not guaranteed thread safe and this reference counter is. The object that decrements the reference counter to zero causes the file to close. This guarantees that all pending writes have completed (in any order).

The malloc'd ref_count_ is actually a 2 element uint64_t array. The first element is the reference count. The count is incremented and decremented via atomic operations. The second element is the final file size. It is set upon call to PosixMmapFile::Close(). The file is extended in chunks during write operations to facilitate memory mapped operations. The object that close the file will also adjust the file size is to match the actual count of bytes used.

Files edited for this fix:

leveldb/include/leveldb/atomics.h: this file originated in eleveldb for the 1.3 release. It is also part of pending changes to the leveldb 2.0 release. While it is "new" to 1.4.2, it is an already reviewed, production file.
leveldb/util/env_posix.cc: logic changes were made to the PosixMmapFile class relating to the use of ref_count_. There was a previous mess of four special purpose background task functions: BGFileCloser(), BGFileCloser2(), BGFileUnmapper(), and BGFileUnmapper2. The mess is now reduced to one, general purpose BGFileUnmapper2 function. The asynchronous close is contained within a new function ReleaseRef() that manages the ref_count_ atomically and also performs the file resize / close when count is zero.

Imm stalls

This is a complex problem, much of it involves poor threading in leveldb. A redesign of threading for Riak 2.0 was already awaiting coding time. A portion of the intended Riak 2.0 implementation is therefore part of the solution here. There are also a couple of non-threading changes that became obvious based upon the customer's LOG data.

Changes:

Completely disable "seek" based compactions. This was partially accomplished in Riak 1.2. The portion that was left in place caused a heavy number of non-beneficial compactions, especially at Level-0. All compactions compete for disk time with recovery log writes and level-0 file creation writes. (db/version_set.cc Version::UpdateStats())
Copy eleveldb_threads source files into leveldb, renamed hot_thread.h/.cc and thread_tasks.h/.cc. This is a thread pool design that was found to be extremely effective / efficient within the Erlang VM environment. The pool's threads are equally available across all vnodes instead of the current 1.4.2 design which places all vnodes into one of five "blocks" where single purpose threads are shared by all vnodes of the block. The customer environment has 51 vnodes, so 1.4.2 has an average of 10 databases (vnodes) competing for one level-0 file creation thread. Now 7 threads (tunable) are equally available to all vnodes.
Adjust file pruning logic to operate properly with simultaneous compactions. The original Google logic calls DBImpl::DeleteObsoleteFiles() after every compaction. This was fine when a given database (vnode) never had more than one compaction completing at a time. Now multiple compactions can complete simultaneously. DBImpl::running_compactions_ now tracks how many compactions are running at a give time. DBImpl::mutex_ is always held whenever the variable is access or updated. DeleteObsoleteFiles() only executes if there is one active compaction.
Adjust manifest maintenance logic to operate properly with simultaneous compactions. The original Google logic locked DBImpl::mutex_, created a final Version object with global file state, unlocked mutex_, wrote update to manifest, locked mutex_, and finally installed the new Version object. With simultaneous compactions, the unlock of mutex_ to write the manifest update allows another compaction to create its final Version object that will not contain file state of the compaction writing to manifest. The result is loss of the first compaction's state and files. VersionSet::LogAndApply() changes the sequence of operations such that the first compaction's Version object is installed before releasing mutex_ for the file write. Therefore its state will be part of the next compaction's global state information.
Remove logic that lets Imm writes (level-0 file creation) interrupt other compactions. At some point Google hacked the compaction code to allow the a full write buffer (DBImpl::imm_ object) to interrupt a running compaction and write its data to a new level-0 file. Imm writes are now asynchronous to executing compactions. A database (vnode) can start an Imm write at any time without concern for already executing compactions.

Provide feedback

Saved searches