Repair Notes

Background - Riak 1.2.0 and before

Google's original repair logic works as follows:

Close the database (a Riak vnode)
Execute the leveldb repair API
- Process recovery log, if it exists, creating a new .sst table file
- Delete the existing MANIFEST file
- Read through every .sst table file to check for errors (move files with errors to lost/. subdirectory)
- Create new MANIFEST listing all valid .sst table files in level 0 (zero)
Open the database
- The normal compaction logic detects more than four level 0 files and starts a compaction. The compaction typically includes EVERY .sst table file. All .sst table files are scanned as part of a multi-file merge. The result is new .sst table files are created in level 1 and beyond.
- NOTE: leveldb blocks all new api Write operations to the database if there are more than 12 .sst table files. The block ends once the first, mass compaction of level 0 files completes.

Basho's attention was drawn to the repair process when a customer with 78,000 .sst table files (~490Giga bytes of data) needed to execute the repair. One of our developers examined the process and realized the database was going to take 6 weeks to repair. Our customer could not wait 6 weeks.

Repair in Riak 1.2.1

In the customer example with 78,000 .sst table files, the original Google multi-merge compaction does the following:

Scans the current key (next key) of each of the 78,000 files to find the solitary next key in sequence.
Writes the next key to a new .sst table file.
Repeats.

The scan of 78,000 files requires not only 78,000 key compares, but also requires the open and close of many (most) of the .sst table files. There is not enough memory to keep all of the files open simultaneously. The process was going to take 6 weeks because of the extreme number of comparisons and file system thrashing.

Riak 1.2.1 adjusted the multi-merge compaction logic. The adjustment limited the number of files in a multi-merge compaction. The max_open_files limit (minus 10) became the "chunking size" for all multi-merge compactions. In the customer example, there were many multi-merge compactions of 200 simultaneous files instead of one multi-merge compaction of 78,000 files. All compactions completed within 11 hours.

Riak-1.2.0 and before	Riak-1.2.1
6 weeks	11 hours

Edge case: Google's original process guaranteed that only the most current key/value pair survived the multi-merge compaction. Riak 1.2.1 creates the opportunity for an older version of a key/value pair to hide the current (newest) key/value pair. Future compaction would correct the problem, but that correction may happen hours / days after the repair. This edge case is not likely since the .sst table files tend to process from oldest to newest, but the edge case does exist.

Repair in Riak 1.3

Google's original repair process assumes the MANIFEST file, which tracks the "level" of each .sst table file, is corrupt. The repair process throws away the MANIFEST file and must default to the idea that all .sst table files are now at level 0. Riak 1.3 makes a subtle change that preserves the "level" to .sst table file relationship beyond the destruction of the MANIFEST file.

Riak 1.3 changed the directory structure of a leveldb database (a Riak vnode). Google's original design placed all database files, including all .sst table files, in one directory. Riak 1.3 creates subdirectories for the .sst table files: sst_0, sst_1, sst_2, sst_3, sst_4, sst_5, and sst_6. The .sst table files now exist in the subdirectory of their "level". Therefore the repair process can create a new MANIFEST file that maintains the "level" to .sst table file relationship.

The entire repair and multi-merge compaction process now completes in minutes. Typically there are no multi-merge compactions. Also the leveldb typically does not have to block new Write operations since there are no longer large numbers of level 0 files to compact.

Riak-1.2.0 and before	Riak-1.2.1	Riak-1.3
6 weeks	11 hours	minutes

The Edge Case discussed for the Riak 1.2.1 release no longer exists. The older key/values remain in the higher levels and the newer key/values remain in the lower levels. There is no hiding created by chunking of the multi-merge compaction since there is no multi-merge compaction of key/values that used to be on different levels.

Edge case: The Riak 1.3 repair process has a similar key/value hiding potential, but for a completely different reason. Transient .sst table files could exist within the 1.3 leveldb subdirectories (sst_0, sst_1, etc.) at the time of a Riak / leveldb crash. The Riak 1.3 repair process blindly assumes they must be valid members of the level and places them in the reconstructed MANIFEST. These transient files could create key/value hiding.

Side effect 1 Using physical subdirectories for leveldb's logical levels creates the opportunity for a hot backup. A backup technology needs to image the base directory, then the sst_* subdirectories in numerical order. "tar" and "rsync" work this way. The restore operation will need to include an execution of repair to cleanup issues with transient files (see notes in Riak-1.4 below).

Side effect 2 It is possible for an administrator to manual create links against the sst_* subdirectories to different storage devices, of potentially different access speeds.

Repair in Riak 1.4

Riak 1.4 repair process extends the Riak 1.3 repair process:

adds a test/fix for transient files, and
holds the Riak "vnode" offline during any clean-up compactions.

The former addresses the Edge case for Riak 1.3 repair via level specific multi-merge compactions of transient files (files with overlapping keys). The latter prevents Riak from sending new writes to the database (vnode) before the repair initiated compactions complete, i.e. prevents Riak from encountering any leveldb Write blockage.

Riak-1.2.0 and before	Riak-1.2.1	Riak-1.3 and Riak-1.4
6 weeks	11 hours	minutes

What will still fail

There is a use case where Riak's repair will fail while Google's would succeed (after running for a few weeks). Suppose a user has one leveldb database that they copy to a remote server via rsync or tar. A few weeks later the user does the same process again to the remote server. But this second time, the user fails to first delete all the existing .sst table files from the remote server. There now exists the potential for both new and old .sst table files on the remote server. Some of the old table files might have older versions of keys that hide newer versions of the same key. The only fix with Riak's repair is to start over by deleting everything on the remote server and copy again.

Repair from within Riak

Notes on how to start a leveldb repair from within Riak are found here:

https://gist.github.com/gburd/b88aee6da7fee81dc036

Provide feedback

Saved searches

Use saved searches to filter your results more quickly