Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/headroom/n4cpu16 failed #101425

Closed
cockroach-teamcity opened this issue Apr 13, 2023 · 4 comments
Closed

roachtest: tpcc/headroom/n4cpu16 failed #101425

cockroach-teamcity opened this issue Apr 13, 2023 · 4 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 13, 2023

roachtest.tpcc/headroom/n4cpu16 failed with artifacts on master @ 678cfd4cbebbf3cced16747baf42a4c54cb2a92d:

test artifacts and logs in: /artifacts/tpcc/headroom/n4cpu16/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: output in run_080354.574459943_n4_cockroach-workload-r: ./cockroach workload run tpcc --warehouses=1470 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-3} returned: parallel execution failure: COMMAND_PROBLEM: ssh verbose log retained in ssh_080355.894560061_n4_cockroach-workload-r.log: exit status 1

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-26943

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 13, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Apr 13, 2023
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Apr 13, 2023
@renatolabs
Copy link
Contributor

It seems that around the time of the failure (08:35:07.939648), n1 could not longer connect to other nodes in the cluster and shortly after the test failed because the workload node couldn't connect to it either. No node crashed, which suggests there was perhaps a network issue during this test.

Something else I noticed is related to the code here:

if err != io.EOF {
log.Warningf(ctx, "closed timestamps side-transport connection dropped from node: %d", nodeID)
} else {
log.VEventf(ctx, 2, "closed timestamps side-transport connection dropped from node: %d (%s)", nodeID, err)
}

Seems we're only logging the error when we know it's an EOF. In the case of this test, the error is not logged, so we don't know what error the receiver is seeing. Is this intentional?

@cockroachdb/replication could you double check the above and whether there's something to look into here?

@aliher1911 aliher1911 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 14, 2023
@aliher1911
Copy link
Contributor

The same failure happened on all 3 nodes. at 08:35:07 they all had unavailable connection errors and then they actually reconnected a moment before test killed them. Maybe we can delay cluster kill for few seconds for failures? That could expose if failures are transient. Not sure what's the monetary cost of that would be, but if most of tests succeed it should be fine?

@aliher1911
Copy link
Contributor

aliher1911 commented Apr 14, 2023

@renatolabs it looks like an oversight as io.EOF is a constant and it makes no sense to print it.

#101550

@aliher1911 aliher1911 added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Apr 14, 2023
@aliher1911
Copy link
Contributor

Closing as a network issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
None yet
Development

No branches or pull requests

3 participants