Fix IngestExternalFile overlapping check #5649

jeffrey-xiao · 2019-07-30T02:46:25Z

Previously, the end key of a range deletion tombstone was considered exclusive for the purposes of deletion, but considered inclusive when checking if two SSTables overlap. For example, an SSTable with a range deletion tombstone [a, b) would be considered overlapping with an SSTable with a range deletion tombstone [b, c). This commit fixes this check.

nvanbenschoten

This looks good to me. @ajkr do you mind taking a look as well?

db/external_sst_file_basic_test.cc

Previously, the end key of a range deletion tombstone was considered exclusive for the purposes of deletion, but considered inclusive when checking if two SSTables overlap. For example, an SSTable with a range deletion tombstone [a, b) would be considered overlapping with an SSTable with a range deletion tombstone [b, c). This commit fixes this check.

ajkr

Looks great, I can't think of any issues.

ajkr · 2019-08-05T21:05:15Z

@anand1976 would you be interested in reviewing? Right now multi-file ingestion is unnecessarily rejecting certain inputs as "overlapping", while in fact they are non-overlapping once the exclusiveness of range tombstone end keys is considered.

38932: storage: build SSTs from KV_BATCH snapshot r=jeffrey-xiao a=jeffrey-xiao Implements the SST snapshot strategy discussed in #16954 and partially implemented in #25134 and #38873, but only have the logic on the receiver side for ease of testing and compatibility. This PR also handles the complications of subsumed replicas that are not fully contained by the current replica. The maximum number of SSTs created using this strategy is 4 + SR + 2 where SR is the number of subsumed replicas. - Three SSTs get streamed from the sender (range local keys, replicated range-id local keys, and data keys) - One SST is constructed for the unreplicated range-id local keys. - One SST is constructed for every subsumed replica to clear the range-id local keys. These SSTs consists of one range deletion tombstone and one `RaftTombstone` key. - A maximum of two SSTs for all subsumed replicas to account for the case of not fully contained subsumed replicas. Note that currently, subsumed replicas can have keys right of the current replica, but not left of, so there will be a maximum of one SST created for the range-local keys and one for the data keys. These SSTs consist of one range deletion tombstone. This number can be further reduced to 3 + SR if we pass the file handles and sst writers from the receiving step to the application step. We can combine the SSTs of the unreplicated range id and replicated id, and the range local of the subsumed replicas and data SSTs of the subsumed replicas. We probably don't want to do this optimization since we'll have to undo this optimization if we start constructing the SSTs from the sender or start chunking large SSTs into smaller SSTs. Blocked by facebook/rocksdb#5649. # Test Plan - [x] Testing knob to inspect SSTs before ingestion. Ensure that expected SSTs for subsumed replicas are ingested. - [x] Unit tests for `SSTSnapshotStorage`. # Metrics and Evaluation One way to evaluate this change is the following steps: 1. Setup 3 node cluster 2. Set default Raft log truncation threshold to some low constant: ```go defaultRaftLogTruncationThreshold = envutil.EnvOrDefaultInt64( "COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD", 128<<10 /* 128 KB */) ``` 3. Set `range_min_bytes` to 0 and `range_max_bytes` to some large number. 4. Increase `kv.snapshot_recovery.max_rate` and `kv.snapshot_rebalance.max_rate` to some large number. 5. Disable load-based splitting. 6. Stop node 2. 7. Run an insert heavy workload (kv0) on the cluster. 8. Start node 2. 9. Time how long it takes for node 2 to have all the ranges. Roachtest: https://gist.github.com/jeffrey-xiao/e69fcad04968822d603f6807ca77ef3b We can have two independent variables 1. Fixed total data size (4000000 ops; ~3.81 GiB), variable number of splits - 32 splits (~121 MiB ranges) - 64 splits (~61.0 MiB ranges) - 128 splits (~31.2 MiB ranges) - 256 splits (~15.7 MiB ranges) - 512 splits (~7.9 MiB ranges) - 1024 splits (~3.9 MiB ranges) 2. Fixed number of splits (32), variable total data size - 125000 (~ 3.7 MiB ranges) - 250000 (~7.5 MiB ranges) - 500000 (~15 MiB ranges) - 1000000 (~30 MiB ranges) - 2000000 (60 MiB ranges) - 4000000 (121 MiB ranges) # Fsync Chunk Size The size of the SST chunk that we write before fsync-ing impacts how fast node 2 has all the ranges. I've experimented 32 splits and an median range size of 121 MB with no fsync-ing (~27s recovery), fsync-ing in 8 MB chunks (~30s recovery), fsync-ing in 2 MB chunks (~40s recovery), fsync-ing in 256 KB chunks (~42s recovery). The default bulk sst sync rate is 2MB and #20352 sets `bytes_per_sync` to 512 KB, so something between those options is probably good. The reason we would want to fsync is to prevent the OS from accumulating such a large buffer that it blocks unrelated small/fast writes for a long time when it flushes. # Impact on Foreground Traffic For testing the impact on foreground traffic, I ran kv0 on a four node cluster with the merge queue and split queue disabled and starting with a constant number of splits. After 5 minutes, I decommissioned node 1 so its replicas would drain to other nodes using snapshots. Roachtest: https://gist.github.com/jeffrey-xiao/5d9443a37b0929884aca927f9c320b6c **Average Range Size of 3 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398633-41a2bb00-b547-11e9-9e3d-747ee724943b.png) - [After](https://user-images.githubusercontent.com/8853434/62398634-41a2bb00-b547-11e9-85e7-445b7989d173.png) **Average Range Size of 32 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398631-410a2480-b547-11e9-9019-86d3bd2e6f73.png) - [After](https://user-images.githubusercontent.com/8853434/62398632-410a2480-b547-11e9-9513-8763e132e76b.png) **Average Range Size 128 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398558-15873a00-b547-11e9-8ab6-2e8e9bae658c.png) - [After](https://user-images.githubusercontent.com/8853434/62398559-15873a00-b547-11e9-9c72-b3e90fce1acc.png) We see p99 latency wins for larger range sizes and comparable performance for smaller range sizes. Release note (performance improvement): Snapshots sent between replicas are now applied more performantly and use less memory. Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com>

anand1976

Looks great! The exclusivity of the end key is encapsulated in the internal key returned by SerializeEndKey(), and sstableKeyCompare() hides the comparison details.

facebook-github-bot

@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-08-15T05:17:26Z

@anand1976 merged this pull request in d61d450.

Summary: Previously, the end key of a range deletion tombstone was considered exclusive for the purposes of deletion, but considered inclusive when checking if two SSTables overlap. For example, an SSTable with a range deletion tombstone [a, b) would be considered overlapping with an SSTable with a range deletion tombstone [b, c). This commit fixes this check. Pull Request resolved: facebook#5649 Differential Revision: D16808765 Pulled By: anand1976 fbshipit-source-id: 5c7ad1c027e4f778d35070e5dae1b8e6037e0d68

facebook-github-bot added the CLA Signed label Jul 30, 2019

jeffrey-xiao mentioned this pull request Jul 30, 2019

storage: build SSTs from KV_BATCH snapshot cockroachdb/cockroach#38932

Merged

2 tasks

jeffrey-xiao force-pushed the fix-boundary-check-ingestion branch from ad702ca to 111e91f Compare July 30, 2019 16:18

jeffrey-xiao marked this pull request as ready for review July 30, 2019 16:18

jeffrey-xiao force-pushed the fix-boundary-check-ingestion branch 3 times, most recently from c8774be to 4cad896 Compare July 31, 2019 03:49

nvanbenschoten approved these changes Jul 31, 2019

View reviewed changes

db/external_sst_file_basic_test.cc Outdated Show resolved Hide resolved

jeffrey-xiao force-pushed the fix-boundary-check-ingestion branch from 4cad896 to f24a7f7 Compare July 31, 2019 17:16

ajkr approved these changes Aug 1, 2019

View reviewed changes

anand1976 approved these changes Aug 14, 2019

View reviewed changes

facebook-github-bot reviewed Aug 14, 2019

View reviewed changes

facebook-github-bot reviewed Aug 15, 2019

View reviewed changes

facebook-github-bot closed this in d61d450 Aug 15, 2019

facebook-github-bot added the Merged label Aug 15, 2019

jeffrey-xiao deleted the fix-boundary-check-ingestion branch August 15, 2019 17:47

yiwu-arbug mentioned this pull request Jan 23, 2020

global seqnum should be applied before sorting a level's files #6245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix IngestExternalFile overlapping check #5649

Fix IngestExternalFile overlapping check #5649

jeffrey-xiao commented Jul 30, 2019

nvanbenschoten left a comment

ajkr left a comment

ajkr commented Aug 5, 2019

anand1976 left a comment

facebook-github-bot left a comment

facebook-github-bot left a comment

facebook-github-bot commented Aug 15, 2019

Fix IngestExternalFile overlapping check #5649

Fix IngestExternalFile overlapping check #5649

Conversation

jeffrey-xiao commented Jul 30, 2019

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

ajkr commented Aug 5, 2019

anand1976 left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Aug 15, 2019