mv repair by level

Status

moved to "master" May 17, 2013
developer tested code checked-in May 9, 2013, level 6 repair postponed to reduce initial complexity
development started May 8, 2013

History / Context

Google's code

Google's original repair logic performed the following actions:

delete existing manifest
create list of all .sst files in the database directory
open and validate each .sst files
move bad .sst files to lost/. directory
create new manifest with remaining good .sst files, all .sst files at level 0
begin merge compaction of all .sst files upon next open of the database

The above process is technically sound and has none of the edge cases discussed in the subsequent hacks. However the simultaneous merging of all .sst files is prohibitively slow. There was a customer with 78,000 .sst files (~140 Gbytes). The merge compaction was timed while running and projected to complete in 6 weeks.

Hack #1

The first performance hack was released in Riak 1.2.1. This hack limited the number of .sst files that could participate in a merge compaction. The file limit was based upon the Options::max_open_files parameter. The compaction merge was therefore chunked instead of processed all at once.

The chunking did create edge cases where newer versions of a key could be hidden by an older version. The hiding could be temporary, i.e. until all pre-repair .sst files were compacted to level 1 or higher. The hiding could be permanent (or last a very long time) if the newer version happened to be compacted twice before the older version was involved in a compaction containing both values.

In the example of 78,000 files, the customer had 200 as their max_open_files. This chunking allowed the complete compaction to occur within 11 hours. Eleven hours was barely tolerated by the customer, but much better than 6 weeks. Riak validates data across three instances of a key (each instance in an different database). Therefore if the edge case occurred, the other two instances (from outside the repaired database) would overrule the older value returned from the repaired database, i.e. no data loss.

Hack #2

Google's original algorithm destroys the manifest. Repair assumes worst case. Worst case is the manifest does not represent the .sst files. (This worst case was not uncommon because older Google versions had a bug that allowed easy corruption of the manifest.) There is therefore no information as to which level a surviving .sst belonged without the manifest. This is why the original repair logic was forced to perform a merge compaction across all surviving .sst files.

The Riak 1.3 began maintaining the .sst file directories by level. This created a hierarchy of .sst files and their leveldb "levels" that survived the destruction of the manifest files. A scan of the directory structure was copied into the replacement manifest with .sst level hierarchy intact. This hack eliminated all merge compactions.

Riak 1.3 was able to migrate 1.2 files into the directory structure quickly. The 78,000 file database reorganized in eight seconds the first time Riak 1.3 opened the 1.2 database. There was no ongoing cost. The manifest itself is blind to the directories structure. The allows a simple rollback from 1.3 to 1.2 by moving files out of the level based directories back into the databases root directory.

Hack #1's edge cases go away. Now there is a new edge case that has two ways to achieve it. The edge case is where the leveldb process crashes and two .sst files exist with overlapping key ranges within the same level directory. Example 1: a compaction is running that is combining newer values from level 1 with older values in level 2, creating replacement .sst files in level 2. A leveldb process crash can leave valid .sst files in level 2 that contain both the older .sst files and their intended replacements. The simple Hack #2 manifest's level 2 inventory with all .sst files of directory 2 creates an invalid manifest. Example 2 is basically the same but concerns a long iteration that prevents files from being deleted that have recently been inputs to a compaction and are to be deleted upon iteration completion. But the process crash leaves them in place.

Hack #2's edge case is again covered by Riak's redundancy and additionally by Riak 1.3's new Active Anti-Entropy feature. The edge case was only realized theoretically after 1.3's release. There are no known customer encounters of the edge case. But it must be addressed and is the motivation for this branch, mv-repair-by-level.

Branch Description

The mv-repair-by-level branch creates an additional step to the existing repair process. This step follows all existing logic that inventories the .sst files and generates a new manifest. This step then:

opens the database
scans each level, from lowest to highest, omitting levels that allow overlapping .sst files
compares .sst files in the level in an N * N-1 loop to see if any overlap
- overlapping .sst files are submitted by key range to existing "manual compaction" routines
- a global flag in the Options structure informs the NewIterator code to treat the normally non-overlapping level as overlapping
level 6 (the highest) is a special case, overlapping files are moved to level 5 and the entire repair call repeated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly