pp skiplist sequential insert

Status

merged to master - September 7, 2015
code complete - August 30, 2015
development started - August 17, 2015

History / Context

Riak has some known special-case workloads (e.g., time series and handoff) where records are inserted in sequential order, that is, where the (N+1)st record that is inserted into leveldb has a key value larger than the Nth record.

When a new record is added to leveldb, it is first added to an in-memory sorted list; a skip list data structure is used to store the sorted records in memory, and the skip list is implemented in the SkipList template class. A skip list is simply a sorted linked list with some additional bookkeeping that allows O(log N) lookups, similar in performance to a balanced binary search tree but without the complexity of tree rebalancing. The leveldb SkipList class is implemented so that it uses non-blocking techniques to perform lookups, although insertions must still be synchronized.

The current implementation of the SkipList class is general purpose in that it does not provide any optimization for sequentially-inserted records. All lookups in a SkipList start at the head of the list and traverse the list until the desired record (or insertion point) is found. Finding the insertion point for a record in a skip list does not require examining every record prior to the insertion point, but the process still starts at the head of the list, comparing the key to be inserted to log(N) records (on average), where N is the number of records in the skip list.

This branch implements additional bookkeeping members in the SkipList class that optimize the sequential insertion use case without penalizing the general use case. This branch also adds several new unit tests for the SkipList class, along with a new method in the SkipList class that can be called from unit tests to thoroughly validate the structure of the SkipList object.

One other benefit of the work in this branch is to add a slight optimization for the general case of inserting into a SkipList in random order. The Insert() method in the current implementation of the SkipList class calls the FindGreaterOrEqual() method to find the insertion point for the new record; the original author added the comment "TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()". This branch creates the barrier-free variant of FindGreaterOrEqual() and calls it from the Insert() method, and all insertions into a SkipList object benefit from this change, not just sequential insertions.

Branch Description

.gitignore

Added the sst_rewrite executable to the .gitignore file so that the "git status" command does not show sst_rewrite as an untracked file. This change is not related to the SkipList changes, but rather it is some git bookkeeping clean up that was inadvertently omitted in previous work.

db/skiplist.h

Added data members to track the tail node. This includes a pointer to the last record in the SkipList object (tail_) along with the collection of nodes that point to the last record (tailPrev_ array and tailHeight_ integer). The tailPrev_ pointers are required to quickly update the skip list bookkeeping when a new record is inserted after the current tail node; otherwise we would have to walk the list to determine which nodes point at the new node, defeating the intent of this optimization.

Also added the flag sequentialInsertMode_, which indicates whether or not we are in sequential insert mode. We start a new SkipList object in sequential insert mode, and we track the tail node so long as we are seeing only sequential inserts; once we have a non-sequential insert, we stop tracking the tail node and revert to the previous behavior of walking the list from head of the list.

Added the NoBarrier_FindGreaterOrEqual() method, a barrier-free variant of FindGreaterOrEqual(). This new method is called only from the Insert() method, whose access is externally synchronized (which is why we can use barrier free calls when traversing the list structure). This new method starts by checking the sequentialInsertMode_ flag to see if we should compare the new key being inserted to the tail node in the list. If not, then the logic is exactly as before (except for calling the Node::NoBarrier_Next() method in the lookup loop). If we are in sequential insert mode, then we compare the key-to-insert to the tail node; if the new key is larger than the tail node's key, then the node to be inserted will become the new tail node, and we return the tailPrev_ pointers to the Insert() method so that it can update the tail-bookkeeping data members.

Updated the Insert() method to be aware that we may be in sequential-insert mode and need to update the tail-bookkeeping data members after inserting the new node. The only tricky part is properly updating the tailPrev_ array, ensuring we determine the correct set of nodes that point to the new tail node. The reason this requires care is due to the fact that the new tail node may not be the same height as the previous tail node.

Also added the protected methods Valid() and DisableSequentialMode(). These are protected since they are intended for use in unit tests, where they can be exposed via a subclass of the SkipList class. The Valid() method first ensures that the list is properly sorted. Next it checks that each level in the skip list is sorted, and the number of nodes decreases (or at least does not increase) as we move up each level. If we're in sequential insert mode, it then checks the bookkeeping information for the tail pointer.

db/skiplist_test.cc

Added a subclass of the SkipList template class that exposes the Valid() and DisableSequentialInsertMode() methods. Changed all uses of the SkipList class to use the subclass. Added several unit tests that drive sequential inserts in various ways, checking validity as well as performing timings to compare sequential inserts with and without the tail pointer optimization. Also added test that does a bunch of sequential inserts and then switches to non-sequential inserts, just to ensure all works as expected.

Performance measurement note (by @matthewvon)

Used the following command line write 100 million sequential keys with and without this code change:

wbs=134217728; r=100000000; t=1; vs=800; cs=1048576; of=500000; ./db_bench --benchmarks=fillseq --num=$r --threads=$t --value_size=$vs --cache_size=$cs --bloom_bits=10 --open_files=$of --db=/var/db2/leveldb --compression_ratio=0.5 --write_buffer_size=$wbs --use_existing_db=0

Timing was consistently a 5% improvement with the new code. Old code was 7 min 31 seconds to 7min 33 seconds across 3 independent executions. New code was 7 min 2 seconds to 7 min 11 seconds across 3 independent executions.

Before each execution:

rm -rf /var/db2/leveldb/*
(as root) sudo echo 3 >/proc/sys/vm/drop_caches

Specs for machine used:

2 Processors: 12 physical cores / 24 logical cores total, 2.4Ghz
32 GBytes RAM
FusionIO ioDrive2 mounted at /var/db2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly