Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: gossip/chaos/nodes=9 failed #126077

Closed
cockroach-teamcity opened this issue Jun 23, 2024 · 2 comments
Closed

roachtest: gossip/chaos/nodes=9 failed #126077

cockroach-teamcity opened this issue Jun 23, 2024 · 2 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team X-duplicate Closed as a duplicate of another issue.

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 23, 2024

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ c0749a3e9d45fb57f8dfa0936626437b887cda56:

(gossip.go:81).2: gossip did not stabilize (dead node 4) in 42.7s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-39761

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jun 23, 2024
@nvanbenschoten
Copy link
Member

This failure was with the new logging added in #125794.

2024/06/23 09:21:43 gossip.go:123: test status: waiting for gossip to exclude dead node %d4
2024/06/23 09:21:43 gossip.go:126: checking if gossip excludes dead node 4
2024/06/23 09:21:43 gossip.go:88: 1: checking gossip
2024/06/23 09:21:44 gossip.go:92: 1: gossip not ok (dead node 4 present) (0s)
2024/06/23 09:21:45 gossip.go:126: checking if gossip excludes dead node 4
2024/06/23 09:21:45 gossip.go:88: 1: checking gossip
2024/06/23 09:21:45 gossip.go:92: 1: gossip not ok (dead node 4 present) (1s)
2024/06/23 09:22:26 gossip.go:126: checking if gossip excludes dead node 4
2024/06/23 09:22:26 test_impl.go:414: test failure #1: full stack retained in failure_1.log: (gossip.go:81).2: gossip did not stabilize (dead node 4) in 42.7s

We see a large jump between:

2024/06/23 09:21:45 gossip.go:92: 1: gossip not ok (dead node 4 present) (1s)

and

2024/06/23 09:22:26 gossip.go:126: checking if gossip excludes dead node 4

This is surprising, as the only logic that should be running between these two lines is a time.Sleep(time.Second). This lines up with the analysis in #124828 (comment), where we suspected a pause around this sleep, but couldn't prove it.

I'll add more logging and fold this issue in with that one.

@nvanbenschoten nvanbenschoten added X-duplicate Closed as a duplicate of another issue. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 24, 2024
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 24, 2024
@nvanbenschoten
Copy link
Member

I'll add more logging

Logging added in #126087.

@nvanbenschoten nvanbenschoten closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024
craig bot pushed a commit that referenced this issue Jun 24, 2024
125865: cli: add storage engine metrics to debug zip r=itsbilal a=anish-shanbhag

Metrics from the storage engine are already exposed in the `/debug/lsm` HTTP endpoint. These can be useful when debugging storage issues, and so this change adds these metrics to the debug zip under `/nodes/$N/lsm.txt` in the same text format as the HTTP route. The previously unused `EngineStats` status endpoint was repurposed to serve these metrics from each node.

Fixes: #79518
Epic: none
Release note: none

126087: roachtest: improve logging in gossip/chaos/nodes=9 further r=nvanbenschoten a=nvanbenschoten

Informs #124828.
Informs #126077.

Release note: None

Co-authored-by: Anish Shanbhag <anish.shanbhag@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
asg0451 pushed a commit to asg0451/cockroach that referenced this issue Jun 25, 2024
@github-project-automation github-project-automation bot moved this to roachtest/unit test backlog in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team X-duplicate Closed as a duplicate of another issue.
Projects
No open projects
Status: roachtest/unit test backlog
Development

No branches or pull requests

2 participants