storage: Raft state corruption when receiving snapshot #39604

jeffrey-xiao · 2019-08-12T19:01:28Z

Currently on master, acceptance/bank/cluster-recovery would occasionally fail after (*Replica).applySnapshot’s call to r.store.engine.IngestExternalFiles.

Adding some logging statements around this area, it looks like the old Raft entries are not being deleted which causes the last index to diverge from the truncated index:

I190810 22:37:23.147087 398 storage/replica_raftstorage.go:931  [n2,s2,r19/2:/Table/2{3-4}] [jeffreyxiao][BEFORE INGESTION] LAST INDEX: 31; LAST TERM: 0; TRUNCATED STATE: index:31 term:7
W190810 22:37:23.147167 398 storage/engine/rocksdb.go:116  [rocksdb] [db/version_set.cc:3086] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
W190810 22:37:23.152563 17 storage/engine/rocksdb.go:116  [rocksdb] [db/version_set.cc:3086] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
W190810 22:37:23.153370 398 storage/engine/rocksdb.go:116  [rocksdb] [db/version_set.cc:3086] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
W190810 22:37:23.156833 398 storage/engine/rocksdb.go:116  [rocksdb] [db/version_set.cc:3086] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
I190810 22:37:23.162646 398 storage/replica_raftstorage.go:943  [n2,s2,r19/2:/Table/2{3-4}] [jeffreyxiao][AFTER INGESTION] LAST INDEX: 20; TRUNCATED STATE: {31 7}

The deletion of the Raft entries should be handled by the range deletion of the unreplicated range-id SST:

cockroach/pkg/storage/replica_raftstorage.go

Lines 847 to 853 in b320ff5

    
           // Clearing the unreplicated state. 
        
           unreplicatedPrefixKey := keys.MakeRangeIDUnreplicatedPrefix(r.RangeID) 
        
           unreplicatedStart := engine.MakeMVCCMetadataKey(unreplicatedPrefixKey) 
        
           unreplicatedEnd := engine.MakeMVCCMetadataKey(unreplicatedPrefixKey.PrefixEnd()) 
        
           if err = unreplicatedSST.ClearRange(unreplicatedStart, unreplicatedEnd); err != nil { 
        
           	return errors.Wrapf(err, "error clearing range of unreplicated SST writer") 
        
           }

However, when I replace this range deletion tombstone with point deletions, I'm unable to reproduce the failure on acceptance/bank/cluster-recovery (~80 successful runs), which leads me to believe that it's some strange happening with the ingestion of range deletion tombstones.

CC @nvanbenschoten @tbg @ajkr

The text was updated successfully, but these errors were encountered:

tbg · 2019-08-12T19:58:58Z

👍 what are you trying next? I'd disable the deletion of the ingested SSTs (perhaps manually create an additional hardlink) and also create a Rocks checkpoint (CreateCheckpoint) before the call to IngestExternalFiles. With any luck, this will give you a repro which contains a rocks checkpoint that you can call ingest your SSTs into to reproduce the badness at which point it's just a matter of tracking down the bug.

tbg · 2019-08-12T20:01:24Z

You might also have luck w/ a targeted failpoint in an ingestion snap unit test where you run pid := os.Getpid(); p := os.FindProcess(pid); p.Kill() right after IngestExternalFiles. This might give you a very fast repro and then what I said in the previous comment is trivial.

This reverts commit b320ff5. In the above commit, we are starting both to ingest multiple SSTs at the same time, and additionally these SSTs contain range deletion tombstones. Both are firsts, and it turns out that there are some kinks to work out. The commit causes quite a number of failures, so reduce churn while we do so. See cockroachdb#39604. Release note: None

39619: Revert "storage: build SSTs from KV_BATCH snapshot" r=nvanbenschoten a=tbg This reverts commit b320ff5. In the above commit, we are starting both to ingest multiple SSTs at the same time, and additionally these SSTs contain range deletion tombstones. Both are firsts, and it turns out that there are some kinks to work out. The commit causes quite a number of failures, so reduce churn while we do so. See #39604. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

jeffrey-xiao · 2019-08-13T15:00:14Z

I (hopefully) found a pretty minimal unit test reproduction of this issue.

Changing TestStoreRangeMergeSnapshot to use on-disk RocksDB engines instead of in-memory engines causes the last index to not equal the truncated index after ingestion.

Running cockroach debug keys on the checkpoint of the engine immediately after ingestion shows that there are no Raft entries for the range in question. However, stateLoader.LoadLastIndex reads a Raft entry and therefore returns the incorrect last index.

Reproduction diff:

diff --git a/pkg/storage/client_merge_test.go b/pkg/storage/client_merge_test.go
index 98ddd75e01..0404646c49 100644
--- a/pkg/storage/client_merge_test.go
+++ b/pkg/storage/client_merge_test.go
@@ -3154,6 +3154,20 @@ func TestStoreRangeMergeRaftSnapshot(t *testing.T) {
 		// account.
 		startWithSingleRange: true,
 	}
+	for i := 0; i < 3; i++ {
+		tempDir, _ := testutils.TempDir(t)
+		// defer cleanup()
+		cache := engine.NewRocksDBCache(1 << 20)
+		defer cache.Release()
+		eng, err := engine.NewRocksDB(engine.RocksDBConfig{
+			Dir: tempDir,
+		}, cache)
+		if err != nil {
+			t.Fatal(err)
+		}
+		defer eng.Close()
+		mtc.engines = append(mtc.engines, eng)
+	}
 	mtc.Start(t, 3)
 	defer mtc.Stop()
 	store0, store2 := mtc.Store(0), mtc.Store(2)
diff --git a/pkg/storage/replica_raftstorage.go b/pkg/storage/replica_raftstorage.go
index cded380c95..f30fe06a3c 100644
--- a/pkg/storage/replica_raftstorage.go
+++ b/pkg/storage/replica_raftstorage.go
@@ -932,6 +932,14 @@ func (r *Replica) applySnapshot(
 	}
 	stats.ingestion = timeutil.Now()
 
+	lastIndex, err := r.mu.stateLoader.LoadLastIndex(ctx, r.store.Engine())
+	if err != nil {
+		return err
+	}
+	if s.RaftAppliedIndex != lastIndex {
+		log.Fatalf(ctx, "applied index and last index does not match")
+	}
+
 	// The on-disk state is now committed, but the corresponding in-memory state
 	// has not yet been updated. Any errors past this point must therefore be
 	// treated as fatal.

tbg · 2019-08-13T15:02:03Z

Hmm, sounds weird. Does the problem disappear when you flush the memtable right after the ingestion (before checking LoadLastIndex)?

ajkr · 2019-08-13T17:34:09Z

maybe. I tried the repro and found the unexpected key is in WAL and not yet flushed to SST. I couldn't find a range tombstone in sstable that looks like it covers, however.

unexpected key: 0x01698a757266746c000000000000001000 (/Local/RangeID/2/u/RaftLog/logIndex:16)

found in WAL:

./cockroach debug rocksdb dump_wal --walfile=/tmp/TestStoreRangeMergeRaftSnapshot435829139/000003.log | grep -i 0x01698a757266746c000000000000001000

96,1,731,9303,PUT(0) : 0x01698A757266746C000000000000001000 
359,15,357,31944,DELETE(0) : 0x01698A757266746C000000000000000B00 DELETE(0) : 0x01698A757266746C000000000000000C00 DELETE(0) : 0x01698A757266746C000000000000000D00 DELETE(0) : 0x01698A757266746C000000000000000E00 DELETE(0) : 0x01698A757266746C000000000000000F00 DELETE(0) : 0x01698A757266746C000000000000001000 DELETE(0) : 0x01698A757266746C000000000000001100 DELETE(0) : 0x01698A757266746C000000000000001200 DELETE(0) : 0x01698A757266746C000000000000001300 DELETE(0) : 0x01698A757266746C000000000000001400 DELETE(0) : 0x01698A757266746C000000000000001500 DELETE(0) : 0x01698A757266746C000000000000001600 DELETE(0) : 0x01698A757266746C000000000000001700 PUT(0) : 0x01698A757266747400 PUT(0) : 0x01698A727261736B00

range tombstones:

$ awk '/Data Block/{rt=0}; rt && /HEX/ ; /Range deletions:/ {rt=1};' /tmp/TestStoreRangeMergeRaftSnapshot435829139/*dump.txt
    HEX    0169897200: 0169897300
    HEX    016B12000100: 016B12FFFF000100
    HEX    0200: FFFF00
    HEX    0169897500: 0169897600

ajkr · 2019-08-13T17:38:32Z

Although, considering there's a point deletion for this key in the WAL, I couldn't guess why it shows up in LoadLastIndex()'s scan.

ajkr · 2019-08-14T21:52:41Z

Root cause

We're not passing the global seqnum when reading the range tombstone block so they're all treated as having seqnum zero. During ingestion preparation the range tombstone meta-block gets accessed and (probably accidentally) put in block cache. At that time the global seqno has not been determined so the cached block has no global seqno. Later after the file's ingested and we read its range tombstone meta-block, it'll be retrieved from cache. That leads us to obtain a range tombstone meta-block with no global seqno. In that case, we use the actual seqnos stored in the range tombstones, which are all just zero, so the tombstones cover nothing.

Short-term plan

Ideally Jeffrey can re-land his feature prior to internship ending (end of next week) in order to (a) finish it, and (b) see if it turns up any other fun issues. To do this we are willing to apply some quick and specific fixes to our rocksdb fork even if they aren't easily upstreamable.

The original plan is to pass a flag all the way from file ingestion preparation to ReadRangeDelBlock() (https://github.com/facebook/rocksdb/blob/90cd6c2bb17a54e97c32856254fd666c9cab8dc5/table/block_based/block_based_table_reader.cc#L1398-L1402), indicating whether to set ReadOptions::fill_cache. For ingestion preparation only it'll be false; otherwise (and by default), it'll be true. (citation: @nvanbenschoten)

Alternatively, rather than plumbing through a flag to not cache the range-del block, I wonder if that block should ever be cached for an sstable which has global_seqno == 0 and the range tombstones all have a sequence number of zero. (citation: @petermattis)

Longer-term plan

We could move the range tombstone meta-block read during Open() to only happen if prefetch_index_and_filter_in_cache=true (that variable has already been repurposed for all meta-blocks, e.g., it applies to compression dictionary block). Then the meta-block will have to be lazily loaded in NewRangeTombstoneIterator() if not prefetched. But at that point we have the ability to pass it ReadOptions that tell it not to fill cache: https://github.com/facebook/rocksdb/blob/90cd6c2bb17a54e97c32856254fd666c9cab8dc5/db/external_sst_file_ingestion_job.cc#L409-L414. The final piece of the change would be passing prefetch_index_and_filter_in_cache=false to NewTableReader().

The objective here is for the caller (ingestion preparation) to be able to prevent caching of all blocks by (1) specifying prefetch_index_and_filter_in_cache=false when they open the table, and (2) specifying ReadOptions::fill_cache=false when they read from it.

petermattis · 2019-08-14T22:34:29Z

Root cause

As discussed offline, the original sin is including global_seqno in the Block structure which is cached. From what I've looked at so far, I'd prefer a long term option (and perhaps a short-term option), to remove Block::global_seqno_ and populate BlockIter::global_seqno_ from the table reader's value.

I believe the caching of the range-del block happens when ingestion determines the boundaries for the table. The ingestion code opens a range-del iterator on the sstable and then finds the lower and upper bounds. Note that it also finds the lower and upper bounds for point operations. So we're probably caching the first and last data block in an sstable as well. It should be straightforward to test this hypothesis. If it is true, there is a more serious bug here which could already be affecting CockroachDB. A key in the first or last data block of an ingested sstable which is overwriting an existing key will actually appear older than the existing key. @ajkr or @jeffrey-xiao can you verify this hypothesis?

ajkr · 2019-08-14T22:48:15Z

The ingestion code opens a range-del iterator on the sstable and then finds the lower and upper bounds. Note that it also finds the lower and upper bounds for point operations. So we're probably caching the first and last data block in an sstable as well. It should be straightforward to test this hypothesis. If it is true, there is a more serious bug here which could already be affecting CockroachDB. A key in the first or last data block of an ingested sstable which is overwriting an existing key will actually appear older than the existing key. @ajkr or @jeffrey-xiao can you verify this hypothesis?

The logic that reads the first/last point keys in the sstable during ingestion preparation is here: https://github.com/facebook/rocksdb/blob/4c70cb730614388041b97a31ae2e5addb1279284/db/external_sst_file_ingestion_job.cc#L404-L438. Note it sets ReadOptions::fill_cache=false. By passing those read options to NewIterator(), we prevent it from putting any data blocks it sees into block cache. Although we are passing those same read options to NewRangeTombstoneIterator(), this function unfortunately cannot respect ReadOptions::fill_cache=false since the range tombstone meta-block was already put into cache when the table was opened. So, the problem scope seems limited to range tombstone meta-block, at least according to my understanding.

ajkr · 2019-08-14T22:51:08Z

As discussed offline, the original sin is including global_seqno in the Block structure which is cached. From what I've looked at so far, I'd prefer a long term option (and perhaps a short-term option), to remove Block::global_seqno_ and populate BlockIter::global_seqno_ from the table reader's value.

I like this solution too as having a table-wide, mutable attribute in the Block object seems unnecessary. This solution should allow us to drop the fill_cache=false workaround linked above and let us freely put blocks into block cache during ingestion preparation. Will try it out.

petermattis · 2019-08-15T01:17:59Z

Note it sets ReadOptions::fill_cache=false. By passing those read options to NewIterator(), we prevent it from putting any data blocks it sees into block cache.

Thanks for tracking this down. It is reassuring that exactly this bug was already considered, though unfortunate that the solution didn't apply to the range-del block case.

jeffrey-xiao · 2019-08-15T15:25:54Z

cockroachdb/rocksdb#43 has a temporary fix for the underlying RocksDB bug.

tbg · 2019-08-15T15:51:08Z

I don't think we should close this issue until we have a non-temporary fix.

39689: storage: reintroduce building SSTs from KV_BATCH snapshot r=jeffrey-xiao a=jeffrey-xiao The final commit of #38932 was previously reverted to due an underlying bug in RocksDB with ingesting range deletion tombstones with a global seqno. See #39604 for discussion on the bug and cockroachdb/rocksdb#43 for the temporary short-term resolution of the bug. Release note: None Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com>

ajkr · 2019-08-15T21:40:47Z

As discussed offline, the original sin is including global_seqno in the Block structure which is cached. From what I've looked at so far, I'd prefer a long term option (and perhaps a short-term option), to remove Block::global_seqno_ and populate BlockIter::global_seqno_ from the table reader's value.

I like this solution too as having a table-wide, mutable attribute in the Block object seems unnecessary. This solution should allow us to drop the fill_cache=false workaround linked above and let us freely put blocks into block cache during ingestion preparation. Will try it out.

Actually @jeffrey-xiao might try it out. No worries if you run out of time - I can reclaim it later if needed.

jeffrey-xiao · 2019-08-19T19:02:29Z

Upstream fix is facebook/rocksdb#5719.

irfansharif · 2020-05-13T23:13:57Z

@petermattis: can this be closed out?

jeffrey-xiao added the A-kv Anything in KV that doesn't belong in a more specific category. label Aug 12, 2019

tbg mentioned this issue Aug 13, 2019

storage: avoid RaftGroupDeletedError from RHS in splitTrigger #39571

Merged

tbg mentioned this issue Aug 13, 2019

Revert "storage: build SSTs from KV_BATCH snapshot" #39619

Merged

jeffrey-xiao self-assigned this Aug 13, 2019

tbg mentioned this issue Aug 14, 2019

closedts: shorten target_duration from 30s to 5s #39643

Merged

petermattis mentioned this issue Aug 14, 2019

Fix range deletion tombstone ingestion with global seqno cockroachdb/rocksdb#43

Merged

jeffrey-xiao mentioned this issue Aug 15, 2019

storage: reintroduce building SSTs from KV_BATCH snapshot #39689

Merged

jeffrey-xiao closed this as completed Aug 15, 2019

tbg reopened this Aug 15, 2019

tbg assigned ajkr Aug 15, 2019

ajkr removed their assignment Aug 15, 2019

awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 30, 2019

irfansharif closed this as completed Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Raft state corruption when receiving snapshot #39604

storage: Raft state corruption when receiving snapshot #39604

jeffrey-xiao commented Aug 12, 2019

tbg commented Aug 12, 2019

tbg commented Aug 12, 2019 •

edited

Loading

jeffrey-xiao commented Aug 13, 2019 •

edited

Loading

tbg commented Aug 13, 2019

ajkr commented Aug 13, 2019 •

edited

Loading

ajkr commented Aug 13, 2019

ajkr commented Aug 14, 2019 •

edited

Loading

petermattis commented Aug 14, 2019

ajkr commented Aug 14, 2019 •

edited

Loading

ajkr commented Aug 14, 2019 •

edited

Loading

petermattis commented Aug 15, 2019

jeffrey-xiao commented Aug 15, 2019

tbg commented Aug 15, 2019

ajkr commented Aug 15, 2019

jeffrey-xiao commented Aug 19, 2019

irfansharif commented May 13, 2020

storage: Raft state corruption when receiving snapshot #39604

storage: Raft state corruption when receiving snapshot #39604

Comments

jeffrey-xiao commented Aug 12, 2019

tbg commented Aug 12, 2019

tbg commented Aug 12, 2019 • edited Loading

jeffrey-xiao commented Aug 13, 2019 • edited Loading

tbg commented Aug 13, 2019

ajkr commented Aug 13, 2019 • edited Loading

ajkr commented Aug 13, 2019

ajkr commented Aug 14, 2019 • edited Loading

petermattis commented Aug 14, 2019

ajkr commented Aug 14, 2019 • edited Loading

ajkr commented Aug 14, 2019 • edited Loading

petermattis commented Aug 15, 2019

jeffrey-xiao commented Aug 15, 2019

tbg commented Aug 15, 2019

ajkr commented Aug 15, 2019

jeffrey-xiao commented Aug 19, 2019

irfansharif commented May 13, 2020

tbg commented Aug 12, 2019 •

edited

Loading

jeffrey-xiao commented Aug 13, 2019 •

edited

Loading

ajkr commented Aug 13, 2019 •

edited

Loading

ajkr commented Aug 14, 2019 •

edited

Loading

ajkr commented Aug 14, 2019 •

edited

Loading

ajkr commented Aug 14, 2019 •

edited

Loading