Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk stalls on GCP / Azure lead to somewhat frequent test failures #97968

Open
renatolabs opened this issue Mar 3, 2023 · 8 comments
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team

Comments

@renatolabs
Copy link
Contributor

renatolabs commented Mar 3, 2023

We see tests fail because of disk stalls with some regularity, and there's a sense that they happen more than is acceptable. We have a support ticket with GCP [1]. According to their support team, we are expected to see improvements in this area 4-6 weeks after Feb 16 -- in other words, we expect these issues to not come up nearly as often by early April at the latest.

This issue is for us to keep track of disk stall failures (by mentioning this issue on failures caused by disk stalls) and to make sure we check on the progress of the fix and close it when we think it's resolved.

[1] https://console.cloud.google.com/support/cases/detail/v2/42856817?project=cockroach-ephemeral

Jira issue: CRDB-24992

@renatolabs renatolabs added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure T-testeng TestEng Team labels Mar 3, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 3, 2023

cc @cockroachdb/test-eng

nicktrav added a commit to nicktrav/cockroach that referenced this issue Mar 29, 2023
The disk-stalled roachtests were updated in cockroachdb#99747 to use local SSDs.
This change broke the `failover/*/disk-stall` tests, which look for
`/dev/sdb` on GCE (the used for GCE Persistent Disks), but the tests
still create clusters with local SSDs (the roachtest default).

Fix cockroachdb#99902.
Fix cockroachdb#99926.
Fix cockroachdb#99930.

Touches cockroachdb#97968.

Release note: None.
nicktrav added a commit to nicktrav/cockroach that referenced this issue Mar 29, 2023
The disk-stalled roachtests were updated in cockroachdb#99747 to use PDs in favor
of local SSDs. This change broke the `failover/*/disk-stall` tests,
which look for `/dev/sdb` on GCE (the used for GCE Persistent Disks),
but the tests still create clusters with local SSDs (the roachtest
default).

Fix cockroachdb#99902.
Fix cockroachdb#99926.
Fix cockroachdb#99930.

Touches cockroachdb#97968.

Release note: None.
nicktrav added a commit to nicktrav/cockroach that referenced this issue Mar 29, 2023
The disk-stalled roachtests were updated in cockroachdb#99747 to use PDs in favor
of local SSDs. This change broke the `failover/*/disk-stall` tests,
which look for `/dev/sdb` on GCE (the device name used for GCE
Persistent Disks), but the tests still create clusters with local SSDs
(the roachtest default).

Fix cockroachdb#99902.
Fix cockroachdb#99926.
Fix cockroachdb#99930.

Touches cockroachdb#97968.

Release note: None.
craig bot pushed a commit that referenced this issue Mar 30, 2023
98899: feat: allow starting docker container via env variable r=rickystewart a=btkostner

Fixes #87043 by allowing you to specify args via the `COCKROACH_ARGS` env value instead of a command. This is required to be able to use the official cockroach image with GitHub actions via a service. More details in the issue.

Note, do to weirdness with merging commands and env values, I decided that setting the `COCKROACH_ARGS` would ignore any command given. This should reduce issues with people trying to use both ways of specifying args and instead force them to pick one.

Release note (general change): Allow setting docker command args via the `COCKROACH_ARGS` environment variable.

99607: sql: block DROP TENANT based on a session var r=stevendanna a=knz

Fixes #97972.
Epic: CRDB-23559

In clusters where we will promote tenant management operations, we would like to ensure there is one extra step needed for administrators to drop a tenant (and thus irremedially lose data). Given that `sql_safe_updates` is not set automatically when users open their SQL session using their own client, we need another mechanism.

This change introduces the new (hidden) session var, `disable_drop_tenant`. When set, tenant deletion fails with the following error message:

```
demo@127.0.0.1:26257/movr> drop tenant foo;
ERROR: rejected (via sql_safe_updates or disable_drop_tenant): DROP TENANT causes irreversible data loss
SQLSTATE: 01000
```

(The session var `sql_safe_updates` is _also_ included as a blocker in the mechanism so that folk using `cockroach sql` get double protection).

The default value of this session var is `false` in single-tenant clusters, for compatibility with CC Serverless. It will be set to `true` via a config profile (#98466) when suitable.

Release note: None

99690: ui: drop index with space r=maryliag a=maryliag

Previously, if the index had a space on its name,
it would fail to drop.
This commit adds quotes so it can be executed.

Fixes #97988

Schema Insights
https://www.loom.com/share/04363b7f83484b5da19c760eb8d0de21

Table Details page
https://www.loom.com/share/1519b897a14440ddb066fb2ab03feb2d

Release note (bug fix): Index recommendation to DROP an index that have a space on its name can now be properly executed.

99750: sql: remove no longer used channel in createStatsNode r=yuzefovich a=yuzefovich

This hasn't been used as of fe6377c. Also mark `create_stats.go` as owned by SQL Queries.

Epic: None

Release note: None

99962: ui: add checks for values r=maryliag a=maryliag

Fixes #99655
Fixes #99538
Fixes #99539

Add checks to usages that could cause
`Cannot read properties of undefined`.

Release note: None

99963: roachtest: use local SSDs for disk-stall failover tests r=andrewbaptist a=nicktrav

The disk-stalled roachtests were updated in #99747 to use local SSDs. This change broke the `failover/*/disk-stall` tests, which look for `/dev/sdb` on GCE (the used for GCE Persistent Disks), but the tests still create clusters with local SSDs (the roachtest default).

Fix #99902.
Fix #99926.
Fix #99930.

Touches #97968.

Release note: None.

Co-authored-by: Blake Kostner <git@btkostner.io>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: maryliag <marylia@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Mar 30, 2023
The disk-stalled roachtests were updated in #99747 to use PDs in favor
of local SSDs. This change broke the `failover/*/disk-stall` tests,
which look for `/dev/sdb` on GCE (the device name used for GCE
Persistent Disks), but the tests still create clusters with local SSDs
(the roachtest default).

Fix #99902.
Fix #99926.
Fix #99930.

Touches #97968.

Release note: None.
@andrewbaptist
Copy link
Collaborator

@renatolabs - I wanted to check what was the next step on this. It appears to still be happening, but the case on the GCP side is closed.

@blathers-crl blathers-crl bot added the T-storage Storage Team label Oct 20, 2023
@jbowens jbowens added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Oct 20, 2023
@msbutler
Copy link
Collaborator

msbutler commented Nov 6, 2023

Had another instance of this #113823

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-storage Storage Team
Projects
Status: Backlog
Development

No branches or pull requests

7 participants