Skip to content

mv bucket expiry2

Matthew Von-Maszewski edited this page Mar 16, 2017 · 27 revisions

Status

  • merged to master - March 16, 2017
  • code complete - February 1, 2017
  • development started - December 20, 2016

History / Context

This branch implements expiry by bucket types (Riak KV) and by table (Riak TS). The feature is part of Basho's enterprise edition products, not open source. Enterprise edition products required paid support. Therefore a substantial portion of this feature is isolated within the private leveldb_ee repository. Portions of the feature are also within Basho's eleveldb open source repository. A successful build requires the mv-bucket-expiry2 branch from all three repositories: basho/eleveldb, basho/leveldb, and basho/leveldb_ee. The term "this branch" refers to the collective set of code changes within all three repositories. The term "bucket expiry" refers to this feature as applied to Riak KV's bucket types, simple buckets, and Riak TS's tables.

This feature depends upon callback / service capabilities within Basho's Riak product. The callback support code is within the riak_core module, branch dr-th/service-poc.

Bucket expiry is an extension of Basho's global expiry. Global expiry has three control properties available within Riak's master configuration file riak.conf:

Property name Default Usage
leveldb.expiration off "on" to enable expiry subsystem,
"off" to disable
leveldb.expiration.retention_time 0 (zero) 0 to disable expiry based upon how long since last written,
or duration from write time to expiration,
or "unlimited" to mark records with expiry information but no time limit
leveldb.expiration.mode whole_file "whole_file" to enable leveldb to removed entire files of expired records without compaction,
"normal" to require compaction processing to remove expired records

Note: The leveldb.expiration property within riak.conf is a "master switch". It must be "on" to enable any bucket specific properties. leveldb ignores all bucket specific expiry properties if leveldb.expiration is set to "off" within riak.conf.

This branch allows each bucket to also have one or more of the above properties, customized for that bucket. Any properties omitted at the bucket level assume the property value within riak.conf, or the default if not set directly within riak.conf.

leveldb requests the bucket specific properties from Riak. leveldb holds the bucket specific properties within a cache for 5 minutes. After 5 minutes, leveldb erases its current copy of a buckets properties and requests a fresh copy from Riak. This means that any user changes to bucket specific expiry properties may take up to 5 minutes to activate within leveldb.

Three properties, Three names each

The table above gives the names of the properties as used within riak.conf file. The property names are slightly different within leveldb/eleveldb and again slightly different when used with bucket type properties. Here is a cross reference:

riak.conf leveldb::ExpiryModuleOS Riak Buckets
leveldb.expiration expiry_enabled expiration
leveldb.expiration.retention_time expiry_minutes default_time_to_live
leveldb.expiration.mode whole_file_expiry expiration_mode

The first table above gave the general meaning of each property. The general meaning does not change in the other two locations. But some of the property values change slightly:

leveldb::ExpiryModuleOS Values
expiry_enabled true / false
expiry_minutes number of minutes
whole_file_expiry true / false
Riak Buckets Values
expiration enabled / disabled
default_time_to_live "unlimited" or a duration string
expiration_mode "use_global_config" / "per_item" / "whole_file"

A duration string consists of series of one or more number/suffix combinations. Example: "2d7h32m" is two days, 7 hours, and 32 minutes. The code converts that example string to 3,332 minutes. The number must be a whole number, no decimal fractions. The valid suffixes are "f" (fortnight), "w" (week), "d" (day), "h" (hour), and "m" minute.

Branch description

leveldb's enterprise edition requires the following command to retrieve the leveldb_ee repository:

cd leveldb
git submodule update --init

then:

make clean
make

eleveldb automates the above based upon setting the BASHO_EE environment variable to 1 (important: no spaces between BASHO_EE, the equal sign, and the number 1):

cd eleveldb
export BASHO_EE=1

then:

make clean
make

Also, the dr-th/service-poc branch within the basho/riak_core is needed to activate the bucket expiry feature.

leveldb_ee: cache_warm.cc

Cleared a signed versus unsigned comparison warning.

leveldb_ee: cuttlefish.cc / cuttlefish_test.cc

This source file contains the CuttlefishDurationMinutes() function. This function duplicates most of the capabilities found within:

https://github.com/basho/cuttlefish/blob/develop/src/cuttlefish_duration.erl

The notable exception is that this function does not support decimal values. Example: "1.5d" is the equivalent of "36h" in cuttlefish_duration.erl, but will parse as "5d" or "120h" in cuttlefish.cc.

leveldb_ee: expiry_ee.cc / expiry_ee.h

gUserExpirySample is a local static variable that holds a pointer to a user created ExpiryModuleEE object. The pointer used is a reference counted pointer. The user object is automatically freed upon destruction of its last pointer.

CreateExpiryModule() is a factory function. Either the routine in this source file, or the same named function in leveldb/leveldb_os/expiry_os_stub.cc, exists upon leveldb's compile ... but only one or the other. Riak's EE build will compile and use this one in expiry_ee.cc. CreateExpiryModule() has Router as a function parameter. Router is the address of eleveldb's callback router that gives the expiry functions access to Riak KV/EE bucket property information. Router is passed from CreateExpiryModule() to InitPropertyCache(). InitPropertyCache() saves the Router address for property cache lookup operations. Only the very first call to CreateExpiryModule() actually uses the Router parameter.

Each call to CreateExpiryModule() uses gUserExpirySample to initialize the new expiry object with the same settings as the most recently stored user object. In case the newly created object is destined for the PropertyCache, it is given a five minute time limit setting. ExpiryModule objects used as part of leveldb's database Options structure ignore this time limit.

ExpiryModuleEE::operator=() is a common C++ assignment operator. It copies member variable settings from the "rhs" object into variable members of "this" object. Note that the time limit setting, m_ExpiryModuleExpiry, does not copy.

NoteUserExpirySetting() is a function that leveldb explicitly calls when opening a user database, not an internal database. It gives the expiry system a way of learning the user's global expiry configuration in riak.conf. The settings then become the default settings for bucket ExpiryModule objects.

ExpiryModuleEE::MemTableInserterCallback() is called by leveldb just before a new write operation stores the user's key/value pair in the write buffer. The role of ExpiryModuleEE is to lookup the bucket's expiry properties and use those when calling the existing ExpiryModuleOS::MemTableInserterCallback(). It is important to note that this function never "expires" a key. It simply determines if the key should be augmented with a time stamp in the key. It is ok to add a time stamp to a key that is not intended for expiry. So the process proceeds even if there is a failure to retrieve bucket specific properties via the Lookup() call.

ExpiryModuleEE::KeyRetirementCallback() does expire keys, removing them from the system during compaction. Therefore, this function will "disable" itself if Lookup() fails when retrieving bucket specific expiry settings. The global expiry settings could be completely different from the bucket specific settings. The safest action is to leave the key alone if Lookup() fails.

ExpiryModuleEE::TableBuilderCallback() does not expire keys. It feeds leveldb's metadata generation for individual .sst table files. This routine will call the generic ExpiryModuleOS version even if Lookup() fails while attempting retrieval of the bucket specific expiry settings.

ExpiryModuleEE::IsFileExpired() indicates when leveldb may directly delete an entire file of expired keys without executing a compaction. Since it deletes data, it aborts if the bucket specific Lookup() fails. Abort in this case means that the target file is not acceptable for deletion.

leveldb_ee: prop_cache_ee.cc / prop_cache_test.cc

The property cache provides leveldb local storage of bucket specific expiry properties. Each entry in the cache is one bucket. PropertyCache is derived from leveldb's Cache object that also manages the file cache and block cache. Its role is to reduce the number of calls from leveldb to eleveldb's Router (and Router's calls into Riak).

Majority of the PropertyCache code is within leveldb/util directory. This is to simplify compiling of eleveldb's Open Source version. Only PropertyCache::LookupWait() uses Riak enterprise edition routines. The KeyParseBucket() call decodes a Riak version 1 object and therefore hidden from open source.

PropertyCache::LookupWait() creates two formats of the Riak bucket string. The first format has independent Bucket Type and Bucket strings that get passed to Riak's Erlang code. The second format is sext() encoded version of the same two fields which becomes the lookup key for the PropertyCache.

The function calls eleveldb's Router then waits up to one second to receive notification that the requested properties are now within the PropertyCache. It is possible that Riak is too busy to reply within that time frame. In which case, the routine returns a NULL handle.

leveldb_ee: riak_object.cc / riak_object.h / riak_object_test.cc

All changes in this file, except the new functions BuildRiakKey() and WriteSextString(), relate to a single issue. The previous release of this code always parsed a Riak key to return both a Bucket Type string and a Bucket string. Now the code can return either the two strings, or simply return the sext() encoded version of both. The sext() encoded version is a faster format for storing and finding entries within the PropertyCache. The two string format, Bucket Type and Bucket, is needed to request bucket properties from Riak.

BuildRiakKey() is currently used only within the unit tests. It is a sext() encoder for Riak keys. Using the existing Erlang or Erlang NIF code that does the same would create a very convoluted unit test within leveldb. WriteSextString() is a support function for BuildRiakKey(). It creates sext() encoded binaries and strings.

leveldb: db/db_impl.cc

ULONG_MAX and ULLONG_MAX are equivalent on 64 bit platforms, but not on 32 bit platforms. ULLONG_MAX is yields the same 64 bit number on both 32 bit and 64 bit platforms.

SanitizeOptions() is part of leveldb's Open database process. It now includes a call to NoteUserExpirySettings() if the user's Option structure includes an expiry object (and if the database being opened is not an internal database). The NoteUserExpirySettings() does nothing in an open source build. The call will cause the enterprise edition to copy the pointer to the ExpiryModuleEE object. That pointer allows the ExpiryModuleEE object to be the property template for future ExpiryModuleEE objects created by the bucket processing code.

leveldb: db/table_cache.cc

Removed debug counters left over from some previous branch / investigation.

leveldb: db/version_edit.h

Clean up ULONG_MAX versus ULLONG_MAX as discussed previously.

leveldb: db/version_set.h

Added "virtual" prefix to the ~Version() destructor to clear up a warning on some platforms (warning appears in build of unit test code).

leveldb: include/leveldb/expiry.h

Create type definitions for the eleveldb Router callback function and its router actions.

Create an explicit value for expiry unlimited. Was previously implied as zero, but zero also implied write time expiry was not active. Now the two definitions have unique, individual values.

leveldb: include/leveldb/perf_count.h & util/perf_count.cc

Add definitions for three property cache statistics: Hit, Miss, and Error.

leveldb: leveldb_os/expiry_os_stub.cc

This source file only compiles during the open source build. It is used to provide simple alternatives to more complex implementations of the same function within the enterprise edition build.

leveldb: leveldb_os/prop_cache_stub.cc

This source file only compiles during the open source build. It provides a simple, do nothing equivalent of the more complex Riak specific equivalent with the enterprise edition build.

leveldb: port/port_posix.cc

This Log() message will post pthread errors to the syslog. (NULL as the first parameter tells the logging code to use syslog.) This is to counter the error where thread creation was silently failing due to lack of memory in a customer's machine.

leveldb: tools/sst_scan.cc

Moved a line of inactive code to prevent a compile time warning. Code not removed since it has a planned, future usage.

leveldb: util/cache.cc

Removed comments that one time were true, but no longer valid.

leveldb: util/env_posix.cc

eleveldb calls Env::Shutdown() during Riak's stop operation. Env::Shutdown() terminates threads and frees various memory resources. This shutdown code is essential so that the Valgrind tool can properly find new memory leaks. This branch adds a call within Env::Shutdown() to also shutdown the ExpiryModule() so that expiry code releases its pointer to a user's ExpiryModuleEE object.

leveldb: util/expiry_os.cc

Many clarifying comments added.

ExpiryModuleOS::KeyRetirementCallback() updated to explicitly use the new kExpiryUnlimited value to disable evaluation of expiry retention period.

Logic for testing file level expiry moved out of ExpiryModuleOS::CompactionFinalizeCallback to new function IsFileExpired(). This change allows the ExpiryModuleEE version of this routine to first retrieve bucket specific properties and call IsFileExpired() with properties specific to the bucket.

leveldb: util/prop_cache.cc & prop_cache.h

The functions within the PropertyCache class are wrappers for the equivalent functions within leveldb's original Cache class. Part of the wrapping logic is to make callouts to the eleveldb Router to retrieve bucket property information from Riak.

A segfault would occur sporadically upon Riak shutdown during development of this branch. The crash was never caught via debugger. The bug is likely that Riak is still processing an eleveldb property request when eleveldb shuts down. That message begins to post to the property cache as the cache is destroyed. The lPropCacheLock and use of the reference counted lPropCache pointer counter the segfault race condition.

eleveldb: c_src/atoms.h

This file appears to parallel the Erlang atom definitions within c_src/eleveldb.cc. However, it has not been maintained. Updated it to the current state and added new atoms used for bucket expiry. Likely the list of atoms at the beginning and end of eleveldb.cc should become a single list of macros here that compile as declaration or definitions depending upon flags preceding the atom.h #include directive ... just sayin'

eleveldb: c_src/eleveldb.cc

Add eleveldb::property_cache and eleveldb::set_metadata_pid functions to the nif_funcs[] array. This array allows routing of execution from same named functions within Erlang code to the replacement function here in C/C++ code.

Added gBucketPropCallback. This global structure holds process identifier for the Erlang process that "services" leveldb's property requests. The structure includes a bool to indicate whether or not it was properly initialized by Riak. When the flag is not set, service request messages do not occur. This allows the eleveldb code to function safely with older Riak versions that do not contain the "service" feature.

CreateExpiryModule() calls now include the address of eleveldb's Router function. The Router function takes requests from leveldb and translates them to Erlang messages. The Router passes the messages to the Riak service via the pid within gBucketPropCallback.

on_unload() contains new lines to disable the service callback.

eleveldb: c_src/router.cc & router.h

leveldb_callback() receives calls from leveldb. In this branch, its only duty is to process requests to retrieve bucket properties. However, there are known extensions to this function on the horizon for other types of requests. The parameters for retrieve bucket properties calls get converted to Erlang terms, then a message containing those terms is sent to the port driver (process gCallbackRouterPid). There is no direct response to this function from the port driver. The port driver sends its response, if any, by calling the property_cache() function.

property_cache() receives calls from the port driver process (gCallbackRouterPid). property_cache() converts the Erlang terms in to C++ data, then stores that data within the leveldb PropertyCache via the Insert() call. PropertyCache::Insert() includes code to signal a condition variable that informs waiting PropertyCache::Lookup() operations that new data is in the cache.

parse_expiry_properties() selects only the bucket properties relating expiry and converts them to C++ equivalents within the passed expiry object.

set_metadata_pid() is used by Erlang process creators and rebuilder routines to keep eleveldb's port driver process id up to date.

eleveldb: private/eleveldb.schema

Removed the code that translated the atom "unlimited" to zero. Now the unlimited atom and zero have independent meanings and all dependent code is updated (unit tests too).

eleveldb: src/eleveldb_metadata.erl

This module routes the bucket property information from Riak to the callback_router in router.cc.

Clone this wiki locally