Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Raft state corruption when receiving snapshot #39604

Closed
jeffrey-xiao opened this issue Aug 12, 2019 · 16 comments
Closed

storage: Raft state corruption when receiving snapshot #39604

jeffrey-xiao opened this issue Aug 12, 2019 · 16 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@jeffrey-xiao
Copy link
Contributor

Currently on master, acceptance/bank/cluster-recovery would occasionally fail after (*Replica).applySnapshot’s call to r.store.engine.IngestExternalFiles.

Adding some logging statements around this area, it looks like the old Raft entries are not being deleted which causes the last index to diverge from the truncated index:

I190810 22:37:23.147087 398 storage/replica_raftstorage.go:931  [n2,s2,r19/2:/Table/2{3-4}] [jeffreyxiao][BEFORE INGESTION] LAST INDEX: 31; LAST TERM: 0; TRUNCATED STATE: index:31 term:7
W190810 22:37:23.147167 398 storage/engine/rocksdb.go:116  [rocksdb] [db/version_set.cc:3086] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
W190810 22:37:23.152563 17 storage/engine/rocksdb.go:116  [rocksdb] [db/version_set.cc:3086] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
W190810 22:37:23.153370 398 storage/engine/rocksdb.go:116  [rocksdb] [db/version_set.cc:3086] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
W190810 22:37:23.156833 398 storage/engine/rocksdb.go:116  [rocksdb] [db/version_set.cc:3086] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
I190810 22:37:23.162646 398 storage/replica_raftstorage.go:943  [n2,s2,r19/2:/Table/2{3-4}] [jeffreyxiao][AFTER INGESTION] LAST INDEX: 20; TRUNCATED STATE: {31 7}

The deletion of the Raft entries should be handled by the range deletion of the unreplicated range-id SST:

// Clearing the unreplicated state.
unreplicatedPrefixKey := keys.MakeRangeIDUnreplicatedPrefix(r.RangeID)
unreplicatedStart := engine.MakeMVCCMetadataKey(unreplicatedPrefixKey)
unreplicatedEnd := engine.MakeMVCCMetadataKey(unreplicatedPrefixKey.PrefixEnd())
if err = unreplicatedSST.ClearRange(unreplicatedStart, unreplicatedEnd); err != nil {
return errors.Wrapf(err, "error clearing range of unreplicated SST writer")
}

However, when I replace this range deletion tombstone with point deletions, I'm unable to reproduce the failure on acceptance/bank/cluster-recovery (~80 successful runs), which leads me to believe that it's some strange happening with the ingestion of range deletion tombstones.

CC @nvanbenschoten @tbg @ajkr

@tbg
Copy link
Member

tbg commented Aug 12, 2019

👍 what are you trying next? I'd disable the deletion of the ingested SSTs (perhaps manually create an additional hardlink) and also create a Rocks checkpoint (CreateCheckpoint) before the call to IngestExternalFiles. With any luck, this will give you a repro which contains a rocks checkpoint that you can call ingest your SSTs into to reproduce the badness at which point it's just a matter of tracking down the bug.

@tbg
Copy link
Member

tbg commented Aug 12, 2019

You might also have luck w/ a targeted failpoint in an ingestion snap unit test where you run pid := os.Getpid(); p := os.FindProcess(pid); p.Kill() right after IngestExternalFiles. This might give you a very fast repro and then what I said in the previous comment is trivial.

@jeffrey-xiao jeffrey-xiao added the A-kv Anything in KV that doesn't belong in a more specific category. label Aug 12, 2019
tbg added a commit to tbg/cockroach that referenced this issue Aug 13, 2019
This reverts commit b320ff5.

In the above commit, we are starting both to ingest multiple SSTs at the
same time, and additionally these SSTs contain range deletion
tombstones. Both are firsts, and it turns out that there are some kinks
to work out. The commit causes quite a number of failures, so reduce
churn while we do so.

See cockroachdb#39604.

Release note: None
craig bot pushed a commit that referenced this issue Aug 13, 2019
39619: Revert "storage: build SSTs from KV_BATCH snapshot" r=nvanbenschoten a=tbg

This reverts commit b320ff5.

In the above commit, we are starting both to ingest multiple SSTs at the
same time, and additionally these SSTs contain range deletion
tombstones. Both are firsts, and it turns out that there are some kinks
to work out. The commit causes quite a number of failures, so reduce
churn while we do so.

See #39604.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@jeffrey-xiao
Copy link
Contributor Author

jeffrey-xiao commented Aug 13, 2019

I (hopefully) found a pretty minimal unit test reproduction of this issue.

Changing TestStoreRangeMergeSnapshot to use on-disk RocksDB engines instead of in-memory engines causes the last index to not equal the truncated index after ingestion.

Running cockroach debug keys on the checkpoint of the engine immediately after ingestion shows that there are no Raft entries for the range in question. However, stateLoader.LoadLastIndex reads a Raft entry and therefore returns the incorrect last index.

Reproduction diff:

diff --git a/pkg/storage/client_merge_test.go b/pkg/storage/client_merge_test.go
index 98ddd75e01..0404646c49 100644
--- a/pkg/storage/client_merge_test.go
+++ b/pkg/storage/client_merge_test.go
@@ -3154,6 +3154,20 @@ func TestStoreRangeMergeRaftSnapshot(t *testing.T) {
 		// account.
 		startWithSingleRange: true,
 	}
+	for i := 0; i < 3; i++ {
+		tempDir, _ := testutils.TempDir(t)
+		// defer cleanup()
+		cache := engine.NewRocksDBCache(1 << 20)
+		defer cache.Release()
+		eng, err := engine.NewRocksDB(engine.RocksDBConfig{
+			Dir: tempDir,
+		}, cache)
+		if err != nil {
+			t.Fatal(err)
+		}
+		defer eng.Close()
+		mtc.engines = append(mtc.engines, eng)
+	}
 	mtc.Start(t, 3)
 	defer mtc.Stop()
 	store0, store2 := mtc.Store(0), mtc.Store(2)
diff --git a/pkg/storage/replica_raftstorage.go b/pkg/storage/replica_raftstorage.go
index cded380c95..f30fe06a3c 100644
--- a/pkg/storage/replica_raftstorage.go
+++ b/pkg/storage/replica_raftstorage.go
@@ -932,6 +932,14 @@ func (r *Replica) applySnapshot(
 	}
 	stats.ingestion = timeutil.Now()
 
+	lastIndex, err := r.mu.stateLoader.LoadLastIndex(ctx, r.store.Engine())
+	if err != nil {
+		return err
+	}
+	if s.RaftAppliedIndex != lastIndex {
+		log.Fatalf(ctx, "applied index and last index does not match")
+	}
+
 	// The on-disk state is now committed, but the corresponding in-memory state
 	// has not yet been updated. Any errors past this point must therefore be
 	// treated as fatal.

@tbg
Copy link
Member

tbg commented Aug 13, 2019

Hmm, sounds weird. Does the problem disappear when you flush the memtable right after the ingestion (before checking LoadLastIndex)?

@jeffrey-xiao jeffrey-xiao self-assigned this Aug 13, 2019
@ajkr
Copy link
Contributor

ajkr commented Aug 13, 2019

maybe. I tried the repro and found the unexpected key is in WAL and not yet flushed to SST. I couldn't find a range tombstone in sstable that looks like it covers, however.

unexpected key: 0x01698a757266746c000000000000001000 (/Local/RangeID/2/u/RaftLog/logIndex:16)

found in WAL:

./cockroach debug rocksdb dump_wal --walfile=/tmp/TestStoreRangeMergeRaftSnapshot435829139/000003.log | grep -i 0x01698a757266746c000000000000001000

96,1,731,9303,PUT(0) : 0x01698A757266746C000000000000001000 
359,15,357,31944,DELETE(0) : 0x01698A757266746C000000000000000B00 DELETE(0) : 0x01698A757266746C000000000000000C00 DELETE(0) : 0x01698A757266746C000000000000000D00 DELETE(0) : 0x01698A757266746C000000000000000E00 DELETE(0) : 0x01698A757266746C000000000000000F00 DELETE(0) : 0x01698A757266746C000000000000001000 DELETE(0) : 0x01698A757266746C000000000000001100 DELETE(0) : 0x01698A757266746C000000000000001200 DELETE(0) : 0x01698A757266746C000000000000001300 DELETE(0) : 0x01698A757266746C000000000000001400 DELETE(0) : 0x01698A757266746C000000000000001500 DELETE(0) : 0x01698A757266746C000000000000001600 DELETE(0) : 0x01698A757266746C000000000000001700 PUT(0) : 0x01698A757266747400 PUT(0) : 0x01698A727261736B00 

range tombstones:

$ awk '/Data Block/{rt=0}; rt && /HEX/ ; /Range deletions:/ {rt=1};' /tmp/TestStoreRangeMergeRaftSnapshot435829139/*dump.txt
    HEX    0169897200: 0169897300
    HEX    016B12000100: 016B12FFFF000100
    HEX    0200: FFFF00
    HEX    0169897500: 0169897600

@ajkr
Copy link
Contributor

ajkr commented Aug 13, 2019

Although, considering there's a point deletion for this key in the WAL, I couldn't guess why it shows up in LoadLastIndex()'s scan.

@ajkr
Copy link
Contributor

ajkr commented Aug 14, 2019

Root cause

We're not passing the global seqnum when reading the range tombstone block so they're all treated as having seqnum zero. During ingestion preparation the range tombstone meta-block gets accessed and (probably accidentally) put in block cache. At that time the global seqno has not been determined so the cached block has no global seqno. Later after the file's ingested and we read its range tombstone meta-block, it'll be retrieved from cache. That leads us to obtain a range tombstone meta-block with no global seqno. In that case, we use the actual seqnos stored in the range tombstones, which are all just zero, so the tombstones cover nothing.

Short-term plan

Ideally Jeffrey can re-land his feature prior to internship ending (end of next week) in order to (a) finish it, and (b) see if it turns up any other fun issues. To do this we are willing to apply some quick and specific fixes to our rocksdb fork even if they aren't easily upstreamable.

The original plan is to pass a flag all the way from file ingestion preparation to ReadRangeDelBlock() (https://github.com/facebook/rocksdb/blob/90cd6c2bb17a54e97c32856254fd666c9cab8dc5/table/block_based/block_based_table_reader.cc#L1398-L1402), indicating whether to set ReadOptions::fill_cache. For ingestion preparation only it'll be false; otherwise (and by default), it'll be true. (citation: @nvanbenschoten)

Alternatively, rather than plumbing through a flag to not cache the range-del block, I wonder if that block should ever be cached for an sstable which has global_seqno == 0 and the range tombstones all have a sequence number of zero. (citation: @petermattis)

Longer-term plan

We could move the range tombstone meta-block read during Open() to only happen if prefetch_index_and_filter_in_cache=true (that variable has already been repurposed for all meta-blocks, e.g., it applies to compression dictionary block). Then the meta-block will have to be lazily loaded in NewRangeTombstoneIterator() if not prefetched. But at that point we have the ability to pass it ReadOptions that tell it not to fill cache: https://github.com/facebook/rocksdb/blob/90cd6c2bb17a54e97c32856254fd666c9cab8dc5/db/external_sst_file_ingestion_job.cc#L409-L414. The final piece of the change would be passing prefetch_index_and_filter_in_cache=false to NewTableReader().

The objective here is for the caller (ingestion preparation) to be able to prevent caching of all blocks by (1) specifying prefetch_index_and_filter_in_cache=false when they open the table, and (2) specifying ReadOptions::fill_cache=false when they read from it.

@petermattis
Copy link
Collaborator

Root cause

As discussed offline, the original sin is including global_seqno in the Block structure which is cached. From what I've looked at so far, I'd prefer a long term option (and perhaps a short-term option), to remove Block::global_seqno_ and populate BlockIter::global_seqno_ from the table reader's value.

I believe the caching of the range-del block happens when ingestion determines the boundaries for the table. The ingestion code opens a range-del iterator on the sstable and then finds the lower and upper bounds. Note that it also finds the lower and upper bounds for point operations. So we're probably caching the first and last data block in an sstable as well. It should be straightforward to test this hypothesis. If it is true, there is a more serious bug here which could already be affecting CockroachDB. A key in the first or last data block of an ingested sstable which is overwriting an existing key will actually appear older than the existing key. @ajkr or @jeffrey-xiao can you verify this hypothesis?

@ajkr
Copy link
Contributor

ajkr commented Aug 14, 2019

The ingestion code opens a range-del iterator on the sstable and then finds the lower and upper bounds. Note that it also finds the lower and upper bounds for point operations. So we're probably caching the first and last data block in an sstable as well. It should be straightforward to test this hypothesis. If it is true, there is a more serious bug here which could already be affecting CockroachDB. A key in the first or last data block of an ingested sstable which is overwriting an existing key will actually appear older than the existing key. @ajkr or @jeffrey-xiao can you verify this hypothesis?

The logic that reads the first/last point keys in the sstable during ingestion preparation is here: https://github.com/facebook/rocksdb/blob/4c70cb730614388041b97a31ae2e5addb1279284/db/external_sst_file_ingestion_job.cc#L404-L438. Note it sets ReadOptions::fill_cache=false. By passing those read options to NewIterator(), we prevent it from putting any data blocks it sees into block cache. Although we are passing those same read options to NewRangeTombstoneIterator(), this function unfortunately cannot respect ReadOptions::fill_cache=false since the range tombstone meta-block was already put into cache when the table was opened. So, the problem scope seems limited to range tombstone meta-block, at least according to my understanding.

@ajkr
Copy link
Contributor

ajkr commented Aug 14, 2019

As discussed offline, the original sin is including global_seqno in the Block structure which is cached. From what I've looked at so far, I'd prefer a long term option (and perhaps a short-term option), to remove Block::global_seqno_ and populate BlockIter::global_seqno_ from the table reader's value.

I like this solution too as having a table-wide, mutable attribute in the Block object seems unnecessary. This solution should allow us to drop the fill_cache=false workaround linked above and let us freely put blocks into block cache during ingestion preparation. Will try it out.

@petermattis
Copy link
Collaborator

Note it sets ReadOptions::fill_cache=false. By passing those read options to NewIterator(), we prevent it from putting any data blocks it sees into block cache.

Thanks for tracking this down. It is reassuring that exactly this bug was already considered, though unfortunate that the solution didn't apply to the range-del block case.

@jeffrey-xiao
Copy link
Contributor Author

cockroachdb/rocksdb#43 has a temporary fix for the underlying RocksDB bug.

@tbg
Copy link
Member

tbg commented Aug 15, 2019

I don't think we should close this issue until we have a non-temporary fix.

@tbg tbg reopened this Aug 15, 2019
@tbg tbg assigned ajkr Aug 15, 2019
craig bot pushed a commit that referenced this issue Aug 15, 2019
39689: storage: reintroduce building SSTs from KV_BATCH snapshot r=jeffrey-xiao a=jeffrey-xiao

The final commit of #38932 was previously reverted to due an underlying bug in RocksDB with ingesting range deletion tombstones with a global seqno. See #39604 for discussion on the bug and cockroachdb/rocksdb#43 for the temporary short-term resolution of the bug.

Release note: None


Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com>
@ajkr
Copy link
Contributor

ajkr commented Aug 15, 2019

As discussed offline, the original sin is including global_seqno in the Block structure which is cached. From what I've looked at so far, I'd prefer a long term option (and perhaps a short-term option), to remove Block::global_seqno_ and populate BlockIter::global_seqno_ from the table reader's value.

I like this solution too as having a table-wide, mutable attribute in the Block object seems unnecessary. This solution should allow us to drop the fill_cache=false workaround linked above and let us freely put blocks into block cache during ingestion preparation. Will try it out.

Actually @jeffrey-xiao might try it out. No worries if you run out of time - I can reclaim it later if needed.

@ajkr ajkr removed their assignment Aug 15, 2019
@jeffrey-xiao
Copy link
Contributor Author

Upstream fix is facebook/rocksdb#5719.

@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Aug 30, 2019
@irfansharif
Copy link
Contributor

@petermattis: can this be closed out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

6 participants