-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: disk stalls on GCP / Azure lead to somewhat frequent test failures #97968
Comments
renatolabs
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-testing
Testing tools and infrastructure
T-testeng
TestEng Team
labels
Mar 3, 2023
cc @cockroachdb/test-eng |
This was referenced Mar 3, 2023
This was referenced Mar 21, 2023
This was referenced Mar 22, 2023
nicktrav
added a commit
to nicktrav/cockroach
that referenced
this issue
Mar 29, 2023
The disk-stalled roachtests were updated in cockroachdb#99747 to use local SSDs. This change broke the `failover/*/disk-stall` tests, which look for `/dev/sdb` on GCE (the used for GCE Persistent Disks), but the tests still create clusters with local SSDs (the roachtest default). Fix cockroachdb#99902. Fix cockroachdb#99926. Fix cockroachdb#99930. Touches cockroachdb#97968. Release note: None.
This was referenced Mar 29, 2023
nicktrav
added a commit
to nicktrav/cockroach
that referenced
this issue
Mar 29, 2023
The disk-stalled roachtests were updated in cockroachdb#99747 to use PDs in favor of local SSDs. This change broke the `failover/*/disk-stall` tests, which look for `/dev/sdb` on GCE (the used for GCE Persistent Disks), but the tests still create clusters with local SSDs (the roachtest default). Fix cockroachdb#99902. Fix cockroachdb#99926. Fix cockroachdb#99930. Touches cockroachdb#97968. Release note: None.
nicktrav
added a commit
to nicktrav/cockroach
that referenced
this issue
Mar 29, 2023
The disk-stalled roachtests were updated in cockroachdb#99747 to use PDs in favor of local SSDs. This change broke the `failover/*/disk-stall` tests, which look for `/dev/sdb` on GCE (the device name used for GCE Persistent Disks), but the tests still create clusters with local SSDs (the roachtest default). Fix cockroachdb#99902. Fix cockroachdb#99926. Fix cockroachdb#99930. Touches cockroachdb#97968. Release note: None.
craig bot
pushed a commit
that referenced
this issue
Mar 30, 2023
98899: feat: allow starting docker container via env variable r=rickystewart a=btkostner Fixes #87043 by allowing you to specify args via the `COCKROACH_ARGS` env value instead of a command. This is required to be able to use the official cockroach image with GitHub actions via a service. More details in the issue. Note, do to weirdness with merging commands and env values, I decided that setting the `COCKROACH_ARGS` would ignore any command given. This should reduce issues with people trying to use both ways of specifying args and instead force them to pick one. Release note (general change): Allow setting docker command args via the `COCKROACH_ARGS` environment variable. 99607: sql: block DROP TENANT based on a session var r=stevendanna a=knz Fixes #97972. Epic: CRDB-23559 In clusters where we will promote tenant management operations, we would like to ensure there is one extra step needed for administrators to drop a tenant (and thus irremedially lose data). Given that `sql_safe_updates` is not set automatically when users open their SQL session using their own client, we need another mechanism. This change introduces the new (hidden) session var, `disable_drop_tenant`. When set, tenant deletion fails with the following error message: ``` demo@127.0.0.1:26257/movr> drop tenant foo; ERROR: rejected (via sql_safe_updates or disable_drop_tenant): DROP TENANT causes irreversible data loss SQLSTATE: 01000 ``` (The session var `sql_safe_updates` is _also_ included as a blocker in the mechanism so that folk using `cockroach sql` get double protection). The default value of this session var is `false` in single-tenant clusters, for compatibility with CC Serverless. It will be set to `true` via a config profile (#98466) when suitable. Release note: None 99690: ui: drop index with space r=maryliag a=maryliag Previously, if the index had a space on its name, it would fail to drop. This commit adds quotes so it can be executed. Fixes #97988 Schema Insights https://www.loom.com/share/04363b7f83484b5da19c760eb8d0de21 Table Details page https://www.loom.com/share/1519b897a14440ddb066fb2ab03feb2d Release note (bug fix): Index recommendation to DROP an index that have a space on its name can now be properly executed. 99750: sql: remove no longer used channel in createStatsNode r=yuzefovich a=yuzefovich This hasn't been used as of fe6377c. Also mark `create_stats.go` as owned by SQL Queries. Epic: None Release note: None 99962: ui: add checks for values r=maryliag a=maryliag Fixes #99655 Fixes #99538 Fixes #99539 Add checks to usages that could cause `Cannot read properties of undefined`. Release note: None 99963: roachtest: use local SSDs for disk-stall failover tests r=andrewbaptist a=nicktrav The disk-stalled roachtests were updated in #99747 to use local SSDs. This change broke the `failover/*/disk-stall` tests, which look for `/dev/sdb` on GCE (the used for GCE Persistent Disks), but the tests still create clusters with local SSDs (the roachtest default). Fix #99902. Fix #99926. Fix #99930. Touches #97968. Release note: None. Co-authored-by: Blake Kostner <git@btkostner.io> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: maryliag <marylia@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Nick Travers <travers@cockroachlabs.com>
blathers-crl bot
pushed a commit
that referenced
this issue
Mar 30, 2023
The disk-stalled roachtests were updated in #99747 to use PDs in favor of local SSDs. This change broke the `failover/*/disk-stall` tests, which look for `/dev/sdb` on GCE (the device name used for GCE Persistent Disks), but the tests still create clusters with local SSDs (the roachtest default). Fix #99902. Fix #99926. Fix #99930. Touches #97968. Release note: None.
This was referenced Jul 10, 2023
This was referenced Jul 18, 2023
@renatolabs - I wanted to check what was the next step on this. It appears to still be happening, but the case on the GCP side is closed. |
jbowens
added
the
A-storage
Relating to our storage engine (Pebble) on-disk storage.
label
Oct 20, 2023
This was referenced Oct 20, 2023
This was referenced Oct 26, 2023
Had another instance of this #113823 |
This was referenced May 13, 2024
renatolabs
changed the title
roachtest: disk stalls on GCP lead to somewhat frequent test failures
roachtest: disk stalls on GCP / Azure lead to somewhat frequent test failures
May 15, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We see tests fail because of disk stalls with some regularity, and there's a sense that they happen more than is acceptable. We have a support ticket with GCP [1]. According to their support team, we are expected to see improvements in this area 4-6 weeks after Feb 16 -- in other words, we expect these issues to not come up nearly as often by early April at the latest.
This issue is for us to keep track of disk stall failures (by mentioning this issue on failures caused by disk stalls) and to make sure we check on the progress of the fix and close it when we think it's resolved.
[1] https://console.cloud.google.com/support/cases/detail/v2/42856817?project=cockroach-ephemeral
Jira issue: CRDB-24992
The text was updated successfully, but these errors were encountered: