Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: sqlsmith: tracking issue for inbox communication errors #66174

Closed
cockroach-teamcity opened this issue Jun 8, 2021 · 12 comments · Fixed by #70280
Closed

roachtest: sqlsmith: tracking issue for inbox communication errors #66174

cockroach-teamcity opened this issue Jun 8, 2021 · 12 comments · Fixed by #70280
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

roachtest.sqlsmith/setup=seed-vec/setting=vec failed with artifacts on master @ cef0bd947590218ea6ac94f4f85dffc25e16fcd0:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/sqlsmith/setup=seed-vec/setting=vec/run_1
	sqlsmith.go:221,sqlsmith.go:251,test_runner.go:801: error: pq: inbox communication error: rpc error: code = Canceled desc = context canceled
		stmt:
		SELECT
			tab_7694._bool AS col_28205,
			tab_7694._timestamptz AS col_28206,
			tab_7695._bool AS col_28207,
			tab_7696._int4 AS col_28208,
			tab_7694._float8 AS col_28209,
			tab_7694._interval AS col_28210,
			tab_7693._timestamp AS col_28211,
			tab_7696._bool AS col_28212,
			tab_7694._string AS col_28213
		FROM
			defaultdb.public.seed_vec@seed_vec__int8__float8__date_idx AS tab_7693,
			defaultdb.public.seed_vec@[0] AS tab_7694,
			defaultdb.public.seed_vec@seed_vec__int8__float8__date_idx AS tab_7695
			JOIN defaultdb.public.seed_vec@seed_vec__int8__float8__date_idx AS tab_7696 ON
					(tab_7695._int2) = (tab_7696._int4)
					AND (tab_7695._uuid) = (tab_7696._uuid)
					AND (tab_7695._timestamp) = (tab_7696._timestamp)
					AND (tab_7695._int8) = (tab_7696._int8)
		WHERE
			tab_7696._bool
		ORDER BY
			tab_7693._int2
		LIMIT
			8:::INT8;
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh sqlsmith/setup=seed-vec/setting=vec

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 8, 2021
@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@yuzefovich

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@cockroach-teamcity

This comment has been minimized.

@yuzefovich
Copy link
Member

With regards to these "inbox communication errors" my guess is that it is actually the test harness problem. For sqlsmith roachtests we use SET testing_vectorize_inject_panics=true;, and my hypothesis is that when we inject a panic, this will trigger the flow context cancellation, which in turn triggers the gRPC streams shutdown. At the moment I'm not certain why we would see the remote cancellation errors in such scenario, but there might be a race between the injected error (which we swallow in the sqlsmith) and the inbox communication error.

To test out this hypothesis I'll adjust the test harness to using the setting in 50%.

craig bot pushed a commit that referenced this issue Aug 16, 2021
68812: randgen: generate random expression indexes r=mgartner a=mgartner

#### randgen: refactor random expression generation

This commit refactors the code that generates random computed columns so
that the logic for generating random expressions can be used in a future
commit to generate random expression indexes.

Release note: None

#### randgen: generate random expression indexes

The `randgen` package now generates schemas with random expression
indexes. This allows for random testing of expression indexes in
`sqlsmith` and ternary logic partitioning (TLP).

Fixes #68174

Release note: None


68918: Revert "streamingccl: hang processors on losing connection with sinkless stream client" r=arulajmani a=adityamaru

This reverts commit f5244f4.

68990: roachtest/tests: adjust sqlsmith slightly r=yuzefovich a=yuzefovich

This commit adjusts `sqlsmith` roachtest slightly so that vectorized
panic injection occurs with 50% probability (instead of 100%). This is
done to check whether the panic injection is the root cause of the inbox
communication errors we have been seeing sporadically.

Informs: #66174.

Release note: None

69003: backupccl: skip TestBackupRestoreSystemJobProgress under stressrace r=arulajmani a=adityamaru

The test times out under stressrace. It runs without flaking under `stress` after #68961.

Release note: None

Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 18, 2021
@yuzefovich yuzefovich assigned yuzefovich and unassigned rytaft Aug 18, 2021
@yuzefovich yuzefovich changed the title roachtest: sqlsmith/setup=seed-vec/setting=vec failed roachtest: sqlsmith: tracking issue for inbox communication errors Aug 18, 2021
@yuzefovich
Copy link
Member

yuzefovich commented Aug 20, 2021

#69188 - vectorized panic injection is enabled.
#69228 - vectorized panic injection is enabled.
68702 - vectorized panic injection is enabled.
#69736 - vectorized panic injection is enabled.
#69831 - vectorized panic injection is enabled.
#70073 - vectorized panic injection is enabled.
#70074 - vectorized panic injection is enabled.
#70105 - vectorized panic injection is enabled.
#70246 - vectorized panic injection is enabled.

@yuzefovich
Copy link
Member

Alright, so it does look like the vectorized panic injection is the cause of these "inbox communication errors". My understanding is as follows: we have a distributed plan, one of the flows get the injected panic (most likely in Init) which is propagated either to the flow coordinator on the gateway or to the outbox on any of the nodes; next, that component triggers the flow cancellation of its flow which shuts down the gRPC stream which will be represented as an "inbox communication error" on one of the inboxes. There is likely a race between propagating the original injected error and this "inbox communication error" to the DistSQLReceiver.

I have an idea of how to make things better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants