Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover/chaos/read-only failed #106013

Closed
cockroach-teamcity opened this issue Jul 2, 2023 · 10 comments
Closed

roachtest: failover/chaos/read-only failed #106013

cockroach-teamcity opened this issue Jul 2, 2023 · 10 comments
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 2, 2023

roachtest.failover/chaos/read-only failed with artifacts on release-23.1 @ e12e85479312972b551677203849d29aeb38ad5f:

(cluster.go:2247).Run: output in run_094944.124037951_n1-10_echo-0-sudo-blockdev: echo "0 $(sudo blockdev --getsz /dev/nvme1n1) linear /dev/nvme1n1 0" | sudo dmsetup create data1 returned: COMMAND_PROBLEM: exit status 1
(cluster.go:2247).Run: cluster.RunE: context canceled
(cluster.go:2247).Run: cluster.RunE: context canceled
(cluster.go:2247).Run: cluster.RunE: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-only/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-29338

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jul 2, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jul 2, 2023
@erikgrinaker erikgrinaker self-assigned this Jul 3, 2023
@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 3, 2023
@erikgrinaker
Copy link
Contributor

device-mapper: reload ioctl on data1  failed: Device or resource busy

Unclear why, since we successfully unmounted the device just prior.

@erikgrinaker
Copy link
Contributor

I notice that all of these failures are on AWS. I thought we only ran these on GCE -- since the failure modes are identical across clouds (they're synthetic), there is little benefit in doubling the costs by running on two clouds. I'll submit a PR to only run the nightly in GCE.

We do run separate disk-stall tests on both GCE and AWS though, which fail as well (see e.g. #106009). So we'll need to address the root problem here regardless.

@erikgrinaker
Copy link
Contributor

The nodes use two NVMe drives in a software RAID:

nvme2n1      259:1    0 220.7G  0 disk  
└─md0          9:0    0 720.5G  0 raid0 
nvme1n1      259:4    0   500G  0 disk  
└─md0          9:0    0 720.5G  0 raid0 

Will see if we can disable that somehow.

@cockroach-teamcity
Copy link
Member Author

roachtest.failover/chaos/read-only failed with artifacts on release-23.1 @ e12e85479312972b551677203849d29aeb38ad5f:

(cluster.go:2247).Run: output in run_094718.359838974_n1-10_echo-0-sudo-blockdev: echo "0 $(sudo blockdev --getsz /dev/nvme1n1) linear /dev/nvme1n1 0" | sudo dmsetup create data1 returned: COMMAND_PROBLEM: exit status 1
(cluster.go:2247).Run: cluster.RunE: context canceled
(cluster.go:2247).Run: cluster.RunE: context canceled
(cluster.go:2247).Run: cluster.RunE: context canceled
test artifacts and logs in: /artifacts/failover/chaos/read-only/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=2 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

These tests actually don't use local SSDs, but EBS volumes are exposed as NVMe devices.

ROACHTEST_localSSD=false , ROACHTEST_ssd=0

These tests were switched to use PDs rather than local SSDs in #99747 and #99963.

Don't know yet why we're seeing multiple devices and RAIDing here.

@erikgrinaker
Copy link
Contributor

The problem seems to be that we're creating c6id.xlarge instances even though we haven't requested local SSDs, so we get a local SSD and an EBS volume that's then RAIDed together.

@erikgrinaker
Copy link
Contributor

FWIW, these started failing because of a recent TeamCity change that erroneously enabled all roachtests on AWS. I suspect these disk stall tests have been broken on AWS for some time, probably since #99747.

@erikgrinaker
Copy link
Contributor

I wrote up #106058, handing this over to test-eng. The nightly failures will stop when we correctly omit these tests on AWS.

@erikgrinaker erikgrinaker added the T-testeng TestEng Team label Jul 3, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 3, 2023

cc @cockroachdb/test-eng

@erikgrinaker erikgrinaker removed the T-kv KV Team label Jul 3, 2023
@erikgrinaker
Copy link
Contributor

The TeamCity param which ran these tests on AWS has been reverted. The roachtest bug that breaks this test on AWS is tracked in #106058.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team
Projects
None yet
Development

No branches or pull requests

2 participants