Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: change default restore roachtest configuration #92699

Closed
1 of 4 tasks
msbutler opened this issue Nov 29, 2022 · 1 comment · Fixed by #98072
Closed
1 of 4 tasks

backupccl: change default restore roachtest configuration #92699

msbutler opened this issue Nov 29, 2022 · 1 comment · Fixed by #98072
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery

Comments

@msbutler
Copy link
Collaborator

msbutler commented Nov 29, 2022

Our current restore roachtest suite should be updated to better reflect customer workloads/topologies. I propose creating a default topology/workload and refactoring our existing tests to be more intentional in how they branch from the default configuration. Ideally, each test that deviates from the default configuration should explicitly test how this deviation affects performance. The new default configuration is described in detail here.

This issue will track work to:

  • create a new roachtest with the default configuration
  • modify restore2TB/nodes=10 to use the new default workload
  • modify restore2TB/nodes=32 to use the new default workload
  • modify restore2TB/nodes=6/cpus=16/pd-volume=2500GB to use the default workload, 4 nodes, and the default pd-ssd size

Epic CRDB-20915

Jira issue: CRDB-21924

Epic CRDB-20915

@msbutler msbutler added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery labels Nov 29, 2022
@msbutler msbutler self-assigned this Nov 29, 2022
@blathers-crl
Copy link

blathers-crl bot commented Nov 29, 2022

cc @cockroachdb/disaster-recovery

msbutler added a commit to msbutler/cockroach that referenced this issue Dec 22, 2022
This patch introduces a new framework for writing restore roachtests that
minimizes code reuse and leverages our new backup fixture organization. The
framework makes it easy to write a new test using a variety of knobs like:
- hardware: cloud provider, disk volume, # of nodes, # of cpus
- backup fixture: workload, workload scale

The patch is the first in an ongoing effort to redo our roachtests, and
introduces two new roachtests:
- restore/nodes=4: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS,
  restore a tpce backup fixture (25,000 customers, around 400 GB).
- restore/gce: same config as above, run on gce.

Future patches will add more tests that use this framework.

Informs cockroachdb#92699

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Jan 4, 2023
This patch introduces a new framework for writing restore roachtests that
minimizes code reuse and leverages our new backup fixture organization. The
framework makes it easy to write a new test using a variety of knobs like:
- hardware: cloud provider, disk volume, # of nodes, # of cpus
- backup fixture: workload, workload scale

The patch is the first in an ongoing effort to redo our roachtests, and
introduces two new roachtests:
- restore/nodes=4: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS,
  restore a tpce backup fixture (25,000 customers, around 400 GB).
- restore/gce: same config as above, run on gce.

Future patches will add more tests that use this framework.

Informs cockroachdb#92699

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Jan 5, 2023
This patch introduces a new framework for writing restore roachtests that
minimizes code reuse and leverages our new backup fixture organization. The
framework makes it easy to write a new test using a variety of knobs like:
- hardware: cloud provider, disk volume, # of nodes, # of cpus
- backup fixture: workload, workload scale

The patch is the first in an ongoing effort to redo our roachtests, and
introduces two new roachtests:
- restore/nodes=4: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS,
  restore a tpce backup fixture (25,000 customers, around 400 GB).
- restore/gce: same config as above, run on gce.

Notice that this patch also introduces a new naming convention for restore
tests.  The default test is named `restore/nodes=4` and each test which
deviates from the config will highlight the deviation in the name. For example
`restore/gce` only switches the cloud provider and holds all other variables
constant; thus only 'gce' is needed in the name.

Future patches will add more tests that use this framework.

Informs cockroachdb#92699

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Jan 6, 2023
This patch introduces a new framework for writing restore roachtests that
minimizes code reuse and leverages our new backup fixture organization. The
framework makes it easy to write a new test using a variety of knobs like:
- hardware: cloud provider, disk volume, # of nodes, # of cpus
- backup fixture: workload, workload scale

The patch is the first in an ongoing effort to redo our roachtests, and
introduces 3 new roachtests:
- restore/tpce/400GB: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS,
  restore a tpce backup fixture (25,000 customers, around 400 GB).
- restore/tpce/400GB/gce: same config as above, run on gce.
- restore/tpce/8TB/nodes=10: the big one!

Notice that this patch also introduces a new naming convention for restore
tests.  The default test is named `restore/tpce/400GB` and only contains the
basic workload. Each other test name will contain the workload and any specs
which deviate from the default config. For example `restore/tpce/400GB/gce`
only switches the cloud provider and holds all other variables constant; thus
only the workload and 'gce' are needed in the name.

Future patches will add more tests that use this framework.

Informs cockroachdb#92699

Release note: None

enforce naming convention
craig bot pushed a commit that referenced this issue Jan 7, 2023
94143: backupccl: introduce new restore roachtest framework r=lidorcarmel a=msbutler

This patch introduces a new framework for writing restore roachtests that
minimizes code reuse and leverages our new backup fixture organization. The
framework makes it easy to write a new test using a variety of knobs like:
- hardware: cloud provider, disk volume, # of nodes, # of cpus
- backup fixture: workload, workload scale

The patch is the first in an ongoing effort to redo our roachtests, and
introduces 3 new roachtests:
- restore/tpce/400GB: the default configuration: 4 nodes, 8vcpus, 1000 GB EBS,
  restore a tpce backup fixture (25,000 customers, around 400 GB).
- restore/tpce/400GB/gce: same config as above, run on gce.
- restore/tpce/8TB/nodes=10: the big one!

Notice that this patch also introduces a new naming convention for restore
tests.  The default test is named `restore/tpce/400GB` and only contains the
basic workload. Each other test name will contain the workload and any specs
which deviate from the default config. For example `restore/tpce/400GB/gce`
only switches the cloud provider and holds all other variables constant; thus
only the workload and 'gce' are needed in the name.

Future patches will add more tests that use this framework.

Informs #92699

Release note: None

Co-authored-by: Michael Butler <butler@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Mar 7, 2023
97587: allocator: check IO overload on lease transfer r=andrewbaptist a=kvoli

Previously, the allocator would return lease transfer targets without
considering the IO overload of stores involved. When leases would
transfer to the IO overloaded stores, service latency tended to degrade.

This commit adds IO overload checks prior to lease transfers. The IO
overload checks are similar to the IO overload checks for allocating
replicas in #97142.

The checks work by comparing a candidate store against
`kv.allocator.lease_io_overload_threshold` and the mean of other candidates.
If the candidate store is equal to or greater than both these values, it
is considered IO overloaded. The default value is 0.5.

The current leaseholder has to meet a higher bar to be considered IO
overloaded. It must have an IO overload score greater or equal to
`kv.allocator.lease_shed_io_overload_threshold`. The default value is
0.9.

The level of enforcement for IO overload is controlled by
`kv.allocator.lease_io_overload_threshold_enforcement` controls the
action taken when a candidate store for a lease transfer is IO overloaded.

- `ignore`: ignore IO overload scores entirely during lease transfers
  (effectively disabling this mechanism);
- `block_transfer_to`: lease transfers only consider stores that aren't
  IO overloaded (existing leases on IO overloaded stores are left as
  is);
- `shed`: actively shed leases from IO overloaded stores to less IO
  overloaded stores (this is a super-set of block_transfer_to).

The default is `block_transfer_to`.

This commit also updates the existing replica IO overload checks to be
prefixed with `Replica`, to avoid confusion between lease and replica
IO overload checks.

Resolves: #96508

Release note (ops change): Range leases will no longer be transferred to
stores which are IO overloaded.

98041: backupccl: fix off by one index in fileSSTSink file extension r=rhu713 a=rhu713

Currently, the logic that extends the last flushed file fileSSTSink does not trigger if there is only one flushed file. This failure to extend the first flushed file can result in file entries in the backup manifest with duplicate start keys. For example, if the first export response written to the sink contains partial entries of a single key `a`, then the span of the first file will be `a-a`, and the span of the subsequent file will always be `a-<end_key>`. The presence of these duplicate start keys breaks the encoding of the external manifest files list SST as the file path + start key combination in the manifest are assumed to be unique.

Fixes #97953 

Release note: None

98072: backupccl: replace restore2TB and restoretpccInc tests r=lidorcarmel a=msbutler

This patch removes the restore2TB* roachtests which ran a 2TB bank restore to
benchmark restore performance across a few hardware configurations. This patch
also replaces the `restoreTPCCInc/nodes=10` test which tested our ability to
handle a backup with a long chain.

This patch also adds:
1. `restore/tpce/400GB/aws/nodes=4/cpus=16` to measure how per-node throughput
scales when the per node vcpu count doubles relative to default.
2. `restore/tpce/400GB/aws/nodes=8/cpus=8` to measure how per-node throughput
scales when the number of nodes doubles relative to default.
3. `restore/tpce/400GB/aws/backupsIncluded=48/nodes=4/cpus=8` to measure
restore reliability and performance on 48 length long backup chain relative to
default.

A future patch will update the fixtures used in the restore node shutdown
scripts, and add more perf based tests.

Fixes #92699

Release note: None

Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Rui Hu <rui@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
@craig craig bot closed this as completed in b588f1f Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant