backupccl: change default restore roachtest configuration #92699

msbutler · 2022-11-29T22:33:49Z

Our current restore roachtest suite should be updated to better reflect customer workloads/topologies. I propose creating a default topology/workload and refactoring our existing tests to be more intentional in how they branch from the default configuration. Ideally, each test that deviates from the default configuration should explicitly test how this deviation affects performance. The new default configuration is described in detail here.

This issue will track work to:

create a new roachtest with the default configuration
modify restore2TB/nodes=10 to use the new default workload
modify restore2TB/nodes=32 to use the new default workload
modify restore2TB/nodes=6/cpus=16/pd-volume=2500GB to use the default workload, 4 nodes, and the default pd-ssd size

Epic CRDB-20915

Jira issue: CRDB-21924

Epic CRDB-20915

blathers-crl · 2022-11-29T22:33:51Z

cc @cockroachdb/disaster-recovery

This patch introduces a new framework for writing restore roachtests that minimizes code reuse and leverages our new backup fixture organization. The framework makes it easy to write a new test using a variety of knobs like: - hardware: cloud provider, disk volume, # of nodes, # of cpus - backup fixture: workload, workload scale The patch is the first in an ongoing effort to redo our roachtests, and introduces two new roachtests: - restore/nodes=4: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS, restore a tpce backup fixture (25,000 customers, around 400 GB). - restore/gce: same config as above, run on gce. Future patches will add more tests that use this framework. Informs cockroachdb#92699 Release note: None

This patch introduces a new framework for writing restore roachtests that minimizes code reuse and leverages our new backup fixture organization. The framework makes it easy to write a new test using a variety of knobs like: - hardware: cloud provider, disk volume, # of nodes, # of cpus - backup fixture: workload, workload scale The patch is the first in an ongoing effort to redo our roachtests, and introduces two new roachtests: - restore/nodes=4: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS, restore a tpce backup fixture (25,000 customers, around 400 GB). - restore/gce: same config as above, run on gce. Notice that this patch also introduces a new naming convention for restore tests. The default test is named `restore/nodes=4` and each test which deviates from the config will highlight the deviation in the name. For example `restore/gce` only switches the cloud provider and holds all other variables constant; thus only 'gce' is needed in the name. Future patches will add more tests that use this framework. Informs cockroachdb#92699 Release note: None

This patch introduces a new framework for writing restore roachtests that minimizes code reuse and leverages our new backup fixture organization. The framework makes it easy to write a new test using a variety of knobs like: - hardware: cloud provider, disk volume, # of nodes, # of cpus - backup fixture: workload, workload scale The patch is the first in an ongoing effort to redo our roachtests, and introduces 3 new roachtests: - restore/tpce/400GB: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS, restore a tpce backup fixture (25,000 customers, around 400 GB). - restore/tpce/400GB/gce: same config as above, run on gce. - restore/tpce/8TB/nodes=10: the big one! Notice that this patch also introduces a new naming convention for restore tests. The default test is named `restore/tpce/400GB` and only contains the basic workload. Each other test name will contain the workload and any specs which deviate from the default config. For example `restore/tpce/400GB/gce` only switches the cloud provider and holds all other variables constant; thus only the workload and 'gce' are needed in the name. Future patches will add more tests that use this framework. Informs cockroachdb#92699 Release note: None enforce naming convention

94143: backupccl: introduce new restore roachtest framework r=lidorcarmel a=msbutler This patch introduces a new framework for writing restore roachtests that minimizes code reuse and leverages our new backup fixture organization. The framework makes it easy to write a new test using a variety of knobs like: - hardware: cloud provider, disk volume, # of nodes, # of cpus - backup fixture: workload, workload scale The patch is the first in an ongoing effort to redo our roachtests, and introduces 3 new roachtests: - restore/tpce/400GB: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS, restore a tpce backup fixture (25,000 customers, around 400 GB). - restore/tpce/400GB/gce: same config as above, run on gce. - restore/tpce/8TB/nodes=10: the big one! Notice that this patch also introduces a new naming convention for restore tests. The default test is named `restore/tpce/400GB` and only contains the basic workload. Each other test name will contain the workload and any specs which deviate from the default config. For example `restore/tpce/400GB/gce` only switches the cloud provider and holds all other variables constant; thus only the workload and 'gce' are needed in the name. Future patches will add more tests that use this framework. Informs #92699 Release note: None Co-authored-by: Michael Butler <butler@cockroachlabs.com>

97587: allocator: check IO overload on lease transfer r=andrewbaptist a=kvoli Previously, the allocator would return lease transfer targets without considering the IO overload of stores involved. When leases would transfer to the IO overloaded stores, service latency tended to degrade. This commit adds IO overload checks prior to lease transfers. The IO overload checks are similar to the IO overload checks for allocating replicas in #97142. The checks work by comparing a candidate store against `kv.allocator.lease_io_overload_threshold` and the mean of other candidates. If the candidate store is equal to or greater than both these values, it is considered IO overloaded. The default value is 0.5. The current leaseholder has to meet a higher bar to be considered IO overloaded. It must have an IO overload score greater or equal to `kv.allocator.lease_shed_io_overload_threshold`. The default value is 0.9. The level of enforcement for IO overload is controlled by `kv.allocator.lease_io_overload_threshold_enforcement` controls the action taken when a candidate store for a lease transfer is IO overloaded. - `ignore`: ignore IO overload scores entirely during lease transfers (effectively disabling this mechanism); - `block_transfer_to`: lease transfers only consider stores that aren't IO overloaded (existing leases on IO overloaded stores are left as is); - `shed`: actively shed leases from IO overloaded stores to less IO overloaded stores (this is a super-set of block_transfer_to). The default is `block_transfer_to`. This commit also updates the existing replica IO overload checks to be prefixed with `Replica`, to avoid confusion between lease and replica IO overload checks. Resolves: #96508 Release note (ops change): Range leases will no longer be transferred to stores which are IO overloaded. 98041: backupccl: fix off by one index in fileSSTSink file extension r=rhu713 a=rhu713 Currently, the logic that extends the last flushed file fileSSTSink does not trigger if there is only one flushed file. This failure to extend the first flushed file can result in file entries in the backup manifest with duplicate start keys. For example, if the first export response written to the sink contains partial entries of a single key `a`, then the span of the first file will be `a-a`, and the span of the subsequent file will always be `a-<end_key>`. The presence of these duplicate start keys breaks the encoding of the external manifest files list SST as the file path + start key combination in the manifest are assumed to be unique. Fixes #97953 Release note: None 98072: backupccl: replace restore2TB and restoretpccInc tests r=lidorcarmel a=msbutler This patch removes the restore2TB* roachtests which ran a 2TB bank restore to benchmark restore performance across a few hardware configurations. This patch also replaces the `restoreTPCCInc/nodes=10` test which tested our ability to handle a backup with a long chain. This patch also adds: 1. `restore/tpce/400GB/aws/nodes=4/cpus=16` to measure how per-node throughput scales when the per node vcpu count doubles relative to default. 2. `restore/tpce/400GB/aws/nodes=8/cpus=8` to measure how per-node throughput scales when the number of nodes doubles relative to default. 3. `restore/tpce/400GB/aws/backupsIncluded=48/nodes=4/cpus=8` to measure restore reliability and performance on 48 length long backup chain relative to default. A future patch will update the fixtures used in the restore node shutdown scripts, and add more perf based tests. Fixes #92699 Release note: None Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Rui Hu <rui@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>

msbutler added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery labels Nov 29, 2022

msbutler self-assigned this Nov 29, 2022

blathers-crl bot added the A-disaster-recovery label Nov 29, 2022

exalate-issue-sync bot removed the A-disaster-recovery label Nov 30, 2022

msbutler mentioned this issue Dec 22, 2022

backupccl: introduce new restore roachtest framework #94143

Merged

msbutler mentioned this issue Mar 6, 2023

backupccl: replace restore2TB and restoretpccInc tests #98072

Merged

craig bot closed this as completed in b588f1f Mar 7, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: change default restore roachtest configuration #92699

backupccl: change default restore roachtest configuration #92699

msbutler commented Nov 29, 2022 •

edited

Loading

blathers-crl bot commented Nov 29, 2022

backupccl: change default restore roachtest configuration #92699

backupccl: change default restore roachtest configuration #92699

Comments

msbutler commented Nov 29, 2022 • edited Loading

blathers-crl bot commented Nov 29, 2022

msbutler commented Nov 29, 2022 •

edited

Loading