Skip to content

mv expiry fallback

Matthew Von-Maszewski edited this page Feb 2, 2016 · 7 revisions

Status

  • merged to master -
  • code complete - February 1, 2016
  • development started - January 25, 2016

History / Context

Basho intends to extend Google's leveldb key encoding to enable multi-featured data expiry. The new key encoding will not be recognizable by Google's original code. A newly encoded key will appear as data corruption to Google's original code and not return its value. This creates an implied data loss scenario should the customer be required to rollback from an expiry enabled release to a prior non-expiry enabled release.

Basho will provide a reformat tool that is capable of processing all leveldb .sst table files to remove the new expiry key encoding. But said tool will require a customer to take their systems offline and potential process terabytes of data. Most production environments are very time sensitive and will likely already be under pressure if something occurred that is requiring a rollback.

This branch creates a safety net. It does not create new expiry keys. But it can process the keys and properly return their data. In some instances, it will also reformat the expiry keys back to normal keys. The reformatting is a side effect. There is no active action to reformat keys.

Once this branch is established in a stable customer release, then Basho will be able to safely deploy the new expiry features knowing the customer has a safety net.

Branch Description

Three areas of Google's leveldb need rollback protection: key operations/comparisons, MANIFEST file, and recovery log files. The code changes below address those areas.

db/db_impl.cc

The InternalKey constructor with db/dbformat.h was intentionally changed to have four parameters from three. The new fourth parameter is "expiry". The change was NOT necessary for this branch. However making the changed forced attention to all places that use the InternalKey object. So the change was a proxy for verifying implied impact of expiry's future change throughout the code base.

The ParsedInternalKey constructor has the same change, for the same reason. It does not appear in db/db_impl.cc.

The changes to InternalKey constructor calls in db/db_impl.cc add zero, "0", as their second parameter and have no other impact.

db/db_iter.cc

FindNextUserEntry() treats the two key encodings, kTypeValueWriteTime and kTypeValueExplicitExpiry, as equivalents to kTypeValue for processing. No, this will not be true with the true expiry features in the future. But it will be appropriate in a rollback scenario.

As explained in db/db_impl.cc, the ParsedInteralKey constructor has a new fourth member for expiry. It is not functional at this time. The change merely highlights code that might be impacted by the expiry change and/or rollback.

db/db_test.cc

Only changes to InternalKey constructor as previously discussed. No new tests.

db/dbformat.cc and db/dbformat.h

These two files contain the core of the expiry encoding changes.

db/dbformat.h contains the additions to the enum ValueType. The two new values are descriptive, but lack descriptive comments. This is because the actual implementation and usage will not be open source initially. Therefore design details are not part of this open source file.

The kValueTypeForSeek has implementation comments. The comments are valid for this release due to the changes in db/dbformat.cc's InteralKeyComparator::Compare() routine. The comments may or may not be valid once full expiry functionality arrives.

ExtractValueType() now uses DecodeLeastFixed64() routine from util/coding.h. The new Decode "knows" the keys are in little endian order and extracts only the byte needed for the ValueType. Previously it decode the entire 64 byte value then performed logical AND to get last byte.

KeySuffixSize() is a new routine. Google's code is hard coded in may places to "know" that the stored key has an 8 byte suffix containing non-key metadata. The expiry feature will extend that size to 16 bytes if an expiry key. The new routine defines the appropriate suffix size per key type.

db/dbformat.cc

Updates to constructors for new expiry parameter. No new code.

db/memtable.cc

Simple updates for changes in db/dbformat.h&.cc. No new code.

db/version_edit.cc

db/version_edit.cc reads and writes leveldb's MANIFEST file. This file is essential to leveldb's operation. It must be compatible between revisions.

EncodeTo() is updated to support a new, expiry enabled .sst file description. The code is not necessary in this branch for rollback. It is present to enable proper unit tests of DecodeFrom(), which is essential to rollback. EncodeTo() will produce either legacy MANIFEST file entries if its new second parameter is false (the default). It will create expiry enhanced file entries if its second parameter is true.

DecodeFrom() supports a new MANIFEST "record type" of kNewFile2 (i.e. second version of kNewFile). This code is a copy/paste of the kNewFile case, then read of three additional fields anticipated for the expiry feature. Again the intended usage of expiry1, expiry2, and expiry3 are left for the non-open source code.

db/version_edit.h

The three expiry fields in FileMetaData are not needed for this release. They are added now to enable easy unit test cases.

db/version_edit_test.cc

The new EncodeDecodeExpiry test is a duplicate of the existing EncodeDecode test, but uses kTypeValueExplicitExpiry instead of kTypeValue. The helper function TestEncodeDecode now supports both tests via the new "bool format2" switch.

db/version_set.cc

Contains changes for InternalKey constructor's change in parameters. Also simplifies SaveValue() routine to assume all key types except kTypeDeletion are valid. This is true for the rollback scenario, not true once expiry feature exists.

db/version_set_test.cc

Updates for constructor parameter counts, no new code.

db/write_batch.cc and include/leveldb/write_batch.h

The WriteBatch format is used within the leveldb recovery log. The recovery log is used both as a crash protection and as a simplified method of rapid shutdown. It is normal for the recovery log to exist and have needed key/values upon database open.

The concern is that a user that downgrades might have valid data in the recovery log that is expiry enhanced. Therefore the write batch is adjusted in this branch to enable successful data recovery within older leveldb.

WriteBatch::Iterate() identifies kTypeValueWriteTime and kTypeValueExplicitExpiry keys, then translates them into normal kTypeValue Put() calls.

The new PutWriteTime() and PutExplicitExpiry() calls are not required for this fallback release. They exist for unit testing.

db/write_batch_test.cc

New test case MultipleExpiry copies the previous test, but uses PutExplicitExpiry() instead of Put(). The test is designed to verify batches will automatically adjust back to simple Put() calls when reprocessed.

table/table_test.cc

Update ParsedInteralKey constructor for added parameter.

util/coding.h

Add simple function to grab ValueType byte out of sequence number.

Clone this wiki locally