Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/sink-chaos failed #96419

Closed
cockroach-teamcity opened this issue Feb 2, 2023 · 21 comments · Fixed by #97571
Closed

roachtest: cdc/sink-chaos failed #96419

cockroach-teamcity opened this issue Feb 2, 2023 · 21 comments · Fixed by #97571
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-cdc
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 2, 2023

roachtest.cdc/sink-chaos failed with artifacts on master @ 22244a780dcfaca48162dde8e0f90b5ba9b6bb9c:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1937).Run: output in run_101232.441206973_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_101233.189661854_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/cdc

This test on roachdash | Improve this report!

Jira issue: CRDB-24114

Epic CRDB-11732

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 2, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Feb 2, 2023
@blathers-crl blathers-crl bot added the T-cdc label Feb 2, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 5fbcd8a8deac0205c7df38e340c1eb9692854383:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1937).Run: output in run_102050.580219051_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_102051.370128293_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 8e24570fa366ed038c6ae65f50db5d8e22826db0:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1937).Run: output in run_101856.333108523_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_101857.122870005_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ eb158026c50d8fa856e42f928d844831ea9e6b28:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1937).Run: output in run_102342.926823441_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_102343.724324591_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@samiskin samiskin removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 8, 2023
@samiskin samiskin added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 8, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ e51ffa013c81212870891001f0328912550fa75d:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1937).Run: output in run_103131.063502119_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: context canceled
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 2a7edbeb0737b1309064c25c641a309c2980d9ba:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1937).Run: output in run_100941.831818606_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: context canceled
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 31365e21dc606cdc1e4302c86192ffc5a6cf1255:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1937).Run: output in run_101924.591387929_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: context canceled
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 7e2df35a2f6bf7a859bb0539c8ca43c4e72ed260:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1940).Run: output in run_103323.114592951_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: context canceled
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ c95bef097bd4c213c6b5c0c125a9a846c4479d73:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1940).Run: output in run_103906.927230883_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_103907.684738529_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 3d054f37c7c87f53cb56fac4e5500f0d1130d09a:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1940).Run: output in run_102531.296808624_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_102532.100190027_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ e9c96e7179e19aae2f8d386f67eb950db8c3354b:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1940).Run: output in run_103203.909948525_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_103204.640858670_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@miretskiy
Copy link
Contributor

@samiskin any updates on this issue?

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 286b3e235171a39b8f9910555affcc7ce310741a:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1956).Run: output in run_102934.007520384_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_102934.755935866_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@jayshrivastava
Copy link
Contributor

jayshrivastava commented Feb 22, 2023

Latest 3 failures show a problem while running TPCC

  |   | I230222 10:40:35.105029 1786 workload/pgx_helpers.go:79  [T1] 4  pgx logger [error]: Query logParams=map[args:[25 1 2113] err:ERROR: rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.1.2:59786->10.142.1.4:26257: use of closed network connection (SQLSTATE XXUUU) pid:2383385 sql:
  |   | I230222 10:40:35.105029 1786 workload/pgx_helpers.go:79  [T1] 4 +		SELECT sum(ol_amount) FROM order_line
  |   | I230222 10:40:35.105029 1786 workload/pgx_helpers.go:79  [T1] 4 +		WHERE ol_w_id = $1 AND ol_d_id = $2 AND ol_o_id = $3]
  |   | Error: error in delivery: ERROR: rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.1.2:59786->10.142.1.4:26257: use of closed network connection (SQLSTATE XXUUU)

This is from failure_1.log

@miretskiy
Copy link
Contributor

Perhaps the node crashed? It started happening ~3 weeks ago, and keeps happening consistently.
I don't think it's a one off issue; and we have this as a release blocker.

@jayshrivastava
Copy link
Contributor

jayshrivastava commented Feb 22, 2023

Finally found it. Node 3 panicked:
https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_RoachtestNightlyGceBazel/8785686:id/cdc/sink-chaos/run_1/artifacts.zip!/logs/3.unredacted/cockroach-stderr.log

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x3b6adc9]

goroutine 242783 [running]:
panic({0x5002fc0, 0x9ce4030})
	GOROOT/src/runtime/panic.go:987 +0x3ba fp=0xc00d13be20 sp=0xc00d13bd60 pc=0x49dd5a
runtime.panicmem(...)
	GOROOT/src/runtime/panic.go:260
runtime.sigpanic()
	GOROOT/src/runtime/signal_unix.go:835 +0x2f6 fp=0xc00d13be70 sp=0xc00d13be20 pc=0x4b4c16
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc009b62de0, 0x1)
	github.com/Shopify/sarama/external/com_github_shopify_sarama/async_producer.go:620 +0x1a9 fp=0xc00d13bed0 sp=0xc00d13be70 pc=0x3b6adc9
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc009b62de0)
	github.com/Shopify/sarama/external/com_github_shopify_sarama/async_producer.go:564 +0x537 fp=0xc00d13bf90 sp=0xc00d13bed0 pc=0x3b6a937
github.com/Shopify/sarama.(*partitionProducer).dispatch-fm()
	<autogenerated>:1 +0x26 fp=0xc00d13bfa8 sp=0xc00d13bf90 pc=0x3bbca26
github.com/Shopify/sarama.withRecover(0x0?)
	github.com/Shopify/sarama/external/com_github_shopify_sarama/utils.go:43 +0x3e fp=0xc00d13bfc8 sp=0xc00d13bfa8 pc=0x3bb6f9e
github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer.func1()
	github.com/Shopify/sarama/external/com_github_shopify_sarama/async_producer.go:515 +0x26 fp=0xc00d13bfe0 sp=0xc00d13bfc8 pc=0x3b6a346
runtime.goexit()
	GOROOT/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc00d13bfe8 sp=0xc00d13bfe0 pc=0x4d2a41
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
	github.com/Shopify/sarama/external/com_github_shopify_sarama/async_producer.go:515 +0x1ea

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ e028ce5b14505dfd17ef8b13001c0ab8ac811e3c:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1956).Run: output in run_101206.687098033_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_101207.492156179_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 0d3393b0623a5c258b25725f64f3689e2f54667b:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1956).Run: output in run_100636.525023948_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_100637.266464476_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 39c06b5a438c01c93ffbfeeefe702d3f9b620eaf:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1956).Run: output in run_100937.610495214_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_100938.380837809_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 13c58f621519794e775b7cfc4d8b557bc99eeca0:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 134)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ b0e5507f74c07e13cfda8cda8b9079b457a9f37d:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1956).Run: output in run_101305.020474857_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_101305.764062036_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.cdc/sink-chaos failed with artifacts on master @ 21786aa112e6b822858f281c1cc59608987c5c0a:

test artifacts and logs in: /artifacts/cdc/sink-chaos/run_1
(cluster.go:1956).Run: output in run_101708.818500757_n4_workload-run-tpcc-wa: ./workload run tpcc --warehouses=100 --duration=30m  {pgurl:1-3}  returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_101709.557595290_n4_workload-run-tpcc-wa.log: exit status 1
(cdc.go:283).Close: error shutting down prometheus/grafana: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@miretskiy miretskiy added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 1, 2023
craig bot pushed a commit that referenced this issue Mar 1, 2023
97571: cdc: update sarama from 1.35.0 to 1.38.1 r=miretskiy a=jayshrivastava

A previous update (#95544) which updated sarama to 1.35.0 introduced a bug which resulted in nodes crashing. These failures are shown by #96419. The bug in described in detail in IBM/sarama#2322 and fixed by IBM/sarama@2379257, which is included in version 1.38.1.

Fixes: #96419
Release note: None
Epic: None


Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
@craig craig bot closed this as completed in 7ed45a6 Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-roachtest O-robot Originated from a bot. T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants