Skip to content

key format

Matthew Von-Maszewski edited this page Jul 18, 2016 · 6 revisions

Terminology

  • user key: the key passed from user code to leveldb as part of a write, delete, or get operation
  • internal key: leveldb's internal represent of the user key and includes additional leveldb metadata

Google's original internal key

Google's original leveldb internal keys comprise of four components:

  • total internal key size
  • user's binary key
  • 7 byte sequence number
  • 1 byte type code

The internal key size includes the user's binary key, sequence number, and type code.

The leveldb code combines the sequence number and type code into an 8 byte unsigned integer for storage. The sequence number is left shifted 8 bits, then the type code is OR'd into the least significant byte.

The type code has one of two values kTypeDeletion (0) and kTypeValue (1). kTypeDeletion indicates the internal key is deleted, also known as a tombstone record. kTypeValue indicates the internal key is an active record and the associate value is available to the user.

The sequence number is a zero based number that increases by one for each record (key/value pair) written to the database. A delete operation is logically the same as a regular write. Each delete therefore increases the sequence number also. There is not a sequence number by key. The sequence number is per database (per Riak vnode). It is therefore theoretically possible to corrupt a database by writing more than 2^24 keys and causing the sequence number to overflow.

The user never updates a given key/value pair. The user writes a new copy of the same key with an updated value. leveldb later distinguishes between different revision of the same user key by selecting the record where the internal key has the highest sequence number.

Basho's extended internal key

In 2016, Basho extended the leveldb internal key to support object expiry. The extension adjusts Google's original key design in two ways:

  • the type code supports two additional values: kTypeValueWriteTime (2) and kTypeValueExplicitExpiry (3)
  • internal keys using one of these two new type codes have an additional 8 byte value related to expiry

Specifically, keys using one of the two new type codes have the following components:

  • total internal key size
  • user's binary key
  • 8 byte expiry timestamp (milliseconds since epoch, UTC based)
  • 7 byte sequence number
  • 1 byte type code

Key that contain the new type codes and their expiry receive special handling during Get, Iterate, and compaction processes. Usage details are here:

Clone this wiki locally