Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: build SSTs from KV_BATCH snapshot #38932

Merged
merged 4 commits into from
Aug 9, 2019

Commits on Aug 9, 2019

  1. storage: implement writer interface for RocksDBSstFileWriter

    Rename Add to Put and Delete to Clear. Additionally implement ClearRange
    using DBSstFileWriterDeleteRange and ClearRangeIter using a Clear on all
    iterated keys on the Go side.
    
    Release note: None
    jeffrey-xiao committed Aug 9, 2019
    Configuration menu
    Copy the full SHA
    c9f88f1 View commit details
    Browse the repository at this point in the history
  2. storage: add Truncate method to RocksDBSstFileWriter

    This method truncates the SSTfile being written and returns the data
    that was truncated. It can be called multiple times when writing an SST
    file and can be used to chunk an SST file into pieces. Since SSTs are
    built in an append-only manner, the concatenated chunks is equivalent to
    an SST built without using Truncate and using Finish.
    
    Release note: None
    jeffrey-xiao committed Aug 9, 2019
    Configuration menu
    Copy the full SHA
    a24b098 View commit details
    Browse the repository at this point in the history
  3. storage: add SSTSnapshotStorage

    SSTSnapshotStorage is associated with a store and can be used to create
    SSTSnapshotStorageScratches. Each SSTSnapshotStorageScratch is
    associated with a snapshot and keeps track of the SSTs incrementally
    created when receiving a snapshot.
    
    Release note: None
    jeffrey-xiao committed Aug 9, 2019
    Configuration menu
    Copy the full SHA
    56c7a56 View commit details
    Browse the repository at this point in the history
  4. storage: build SSTs from KV_BATCH snapshot

    Incrementally build SSTs from the batches sent in a KV_BATCH snapshot.
    This logic is only on the receiver side for ease of testing and
    compatibility.
    
    The complications of subsumed replicas that are not fully contained by
    the current replica are also handled. The following is an example of
    this case happening.
    
    a       b       c       d
    |---1---|-------2-------|  S1
    |---1-------------------|  S2
    |---1-----------|---3---|  S3
    
    Since the merge is the first operation to happen, a follower could be
    down before it completes. It is reasonable for r1-snapshot from S3 to
    subsume both r1 and r2 in S1. Note that it's impossible for a replica to
    subsume anything to its left.
    
    The maximum number of SSTs created using the strategy is 4 + SR + 2
    where SR is the number of subsumed replicas.
    
    - Three SSTs get created when the snapshot is being received (range
      local keys, replicated range-id local keys, and user keys).
    - One SST is constructed for the unreplicated range-id local keys when
      the snapshot is being applied.
    - One SST is constructed for every subsumed replica to clear the
      range-id local keys. These SSTs consist of one range deletion
      tombstone and one RaftTombstoneKey.
    - A maximum of two SSTs for all subsumed replicas are constructed to
      account the case of not fully contained subsumed replicas. We need to
      delete the key space of the subsumed replicas that we did not delete
      in the previous SSTs. We need one for the range-local keys and one for
      the user keys. These SSTs consist of normal tombstones, one range
      deletion tombstone, or they could be empty.
    
    This commit also introduced a cluster setting
    "kv.snapshot_sst.sync_size" which defines the maximum SST chunk size
    before fsync-ing. Fsync-ing is necessary to prevent the OS from
    accumulating such a large buffer that it blocks unrelated small/fast
    writes for a long time when it flushes.
    
    Release note (performance improvement): Snapshots sent between replicas
    are now applied more performantly and use less memory.
    jeffrey-xiao committed Aug 9, 2019
    Configuration menu
    Copy the full SHA
    b320ff5 View commit details
    Browse the repository at this point in the history