Skip to content

mv tiered recovery log

Matthew Von-Maszewski edited this page Mar 3, 2016 · 5 revisions

Status

  • merged to master - February 29, 2016
  • code complete - February 29, 2016
  • development started - February 26, 2016

History / Context

@jolson7168 discovered and detailed a scenario where data written to a Riak node utilizing leveldb's tiered storage would lose data. His bug report is found here: https://github.com/basho/riak_kv/issues/1356

This issue is limited to customers using leveldb tiered storage. leveldb permanently writes user data in 30 to 60 megabyte chunks. Before the permanent write, it maintains a temporary “recovery log”. Currently, leveldb is placing the very first recovery log in an incorrect directory. This first recovery log, and only the first, is lost if leveldb restarts before the user has stored sufficient data to cause leveldb’s first permanent write. All subsequent data is immune to this bug.

The underlying problem is that Basho's modifications for tiered storage modify the database name, dbname_, variable of the DBImpl object (database implementation object). The modified name is then used throughout Google's original code as if it was given by the original user. Unfortunately one essential routine, DB::Open(), was written to directly create the first recovery log file without regard to the DBImpl object. Therefore the DB::Open() routine was blind to Basho's modification of the dbname_ variable in support of tiered storage. This created a disconnect between between the first recovery log file's directory path and all subsequent recovery log files' directory path. And placed the first recovery log file in a location that the leveldb start-up logic would not expect to find it. Hence, the user loses the first recovery log file's data if leveldb is restarted before leveldb would normally convert that data to a permanent .sst table file.

Original tiered storage discussion is here

Branch Description

db/db_impl.cc / db/db_impl.h

Google's original code has two routines that use the same sequence of commands to create a new recovery log file: DBImpl::MakeRoomForWrite() and DB::Open(). DB::Open() creates the first recovery log file for each database session. DBImpl::MakeRoomForWrite() creates all subsequent recovery log files. The sequence of commands is identical except for a couple of the parameters used.

This branch creates a new function, DBImpl::NewRecoveryLog(), that utilizes the exact same sequence of commands but unifies the parameters used. The parameters used reside within DBImpl. The DBImpl parameters are therefore tiered storage aware ... and this new single code path is safer for future maintenance.

db/db_test.cc

Added a new test, TieredRecoveryLog, that duplicated the failing scenario with prior code and demonstrates proper behavior with this branch.

Clone this wiki locally