Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/democluster: TestTransientClusterMultitenant failed #96162

Closed
cockroach-teamcity opened this issue Jan 30, 2023 · 10 comments · Fixed by #106679
Closed

cli/democluster: TestTransientClusterMultitenant failed #96162

cockroach-teamcity opened this issue Jan 30, 2023 · 10 comments · Fixed by #106679
Assignees
Labels
A-cluster-upgrades A-jobs branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-jobs
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 30, 2023

Epic: CRDB-18499

cli/democluster.TestTransientClusterMultitenant failed with artifacts on master @ 69dd453d0e61e258f402c5751de310405743cd18:

        	            	  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager.(*Manager).runMigration
        	            	  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager/manager.go:584
        	            	  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager.(*Manager).RunPermanentUpgrades
        	            	  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager/manager.go:269
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*SQLServer).preStart
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_sql.go:1514
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*SQLServerWrapper).PreStart
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/tenant.go:642
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*SQLServerWrapper).Start
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/tenant.go:745
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*Server).startTenantServerInternal
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_new_server.go:153
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*Server).newTenantServer
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_new_server.go:82
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*serverController).startServerInternal
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_orchestration.go:361
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*serverController).startControlledServer.func2
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_orchestration.go:233
        	            	  | [...repeated from below...]
        	            	Wraps: (4) poll-show-jobs
        	            	Wraps: (5) attached stack trace
        	            	  -- stack trace:
        	            	  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*leasedDescriptors).maybeUpdateDeadline
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/leased_descriptors.go:206
        	            	  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).MaybeUpdateDeadline
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:160
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:275
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt.func1
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:130
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithProfiling
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:2530
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:129
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1943
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1948
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1866
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).runWithEx.func1
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:171
        	            	  | runtime.goexit
        	            	  | 	GOROOT/src/runtime/asm_amd64.s:1594
        	            	Wraps: (6) liveness session expired 6.50590502s before transaction
        	            	Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
        	Test:       	TestTransientClusterMultitenant
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/1937dc802442f15720877ed362f2998f/logTestTransientClusterMultitenant1367242056
--- FAIL: TestTransientClusterMultitenant (165.95s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-sessions @cockroachdb/server @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-23974

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 30, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jan 30, 2023
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jan 30, 2023
@knz
Copy link
Contributor

knz commented Jan 30, 2023

full error trace below.

This is telling me that the execution of cluster upgrades when a tenant is first booted up is currently brittle - apparently it's possible for the SQL liveness session to expire before/during the migration.

What is a good answer here? Make the upgrade jobs retry after refreshing the SQL liveness session?

cc @ajwerner if you have ideas. I think @healthy-pod you'll find this interesting too.

    demo_cluster_test.go:302: 
        	Error Trace:	/home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/2426/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/cli/democluster/democluster_test_/democluster_test.runfiles/com_github_cockroachdb_cockroach/pkg/cli/democluster/demo_cluster_test.go:302
        	Error:      	Received unexpected error:
        	            	polling for queued jobs to complete: poll-show-jobs: liveness session expired 6.50590502s before transaction
        	            	(1) attached stack trace
        	            	  -- stack trace:
        	            	  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).waitForJobsToBeTerminalOrPaused
        	            	  | 	github.com/cockroachdb/cockroach/pkg/jobs/wait.go:118
        	            	  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).waitForJobs
        	            	  | 	github.com/cockroachdb/cockroach/pkg/jobs/wait.go:146
        	            	  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).Run
        	            	  | 	github.com/cockroachdb/cockroach/pkg/jobs/wait.go:202
        	            	  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager.(*Manager).runMigration
        	            	  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager/manager.go:584
        	            	  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager.(*Manager).RunPermanentUpgrades
        	            	  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager/manager.go:269
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*SQLServer).preStart
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_sql.go:1514
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*SQLServerWrapper).PreStart
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/tenant.go:642
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*SQLServerWrapper).Start
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/tenant.go:745
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*Server).startTenantServerInternal
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_new_server.go:153
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*Server).newTenantServer
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_new_server.go:82
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*serverController).startServerInternal
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_orchestration.go:361
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*serverController).startControlledServer.func2
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_orchestration.go:233
        	            	  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
        	            	  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
        	            	Wraps: (2) polling for queued jobs to complete
        	            	Wraps: (3) attached stack trace
        	            	  -- stack trace:
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).execInternal.func1.1
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:844
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next.func1
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:424
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*rowsIterator).Next
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:475
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).queryInternalBuffered
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:585
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryRowExWithCols
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:637
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).QueryRowEx
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:623
        	            	  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).waitForJobsToBeTerminalOrPaused
        	            	  | 	github.com/cockroachdb/cockroach/pkg/jobs/wait.go:110
        	            	  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).waitForJobs
        	            	  | 	github.com/cockroachdb/cockroach/pkg/jobs/wait.go:146
        	            	  | github.com/cockroachdb/cockroach/pkg/jobs.(*Registry).Run
        	            	  | 	github.com/cockroachdb/cockroach/pkg/jobs/wait.go:202
        	            	  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager.(*Manager).runMigration
        	            	  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager/manager.go:584
        	            	  | github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager.(*Manager).RunPermanentUpgrades
        	            	  | 	github.com/cockroachdb/cockroach/pkg/upgrade/upgrademanager/manager.go:269
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*SQLServer).preStart
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_sql.go:1514
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*SQLServerWrapper).PreStart
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/tenant.go:642
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*SQLServerWrapper).Start
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/tenant.go:745
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*Server).startTenantServerInternal
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_new_server.go:153
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*Server).newTenantServer
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_new_server.go:82
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*serverController).startServerInternal
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_orchestration.go:361
        	            	  | github.com/cockroachdb/cockroach/pkg/server.(*serverController).startControlledServer.func2
        	            	  | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_orchestration.go:233
        	            	  | [...repeated from below...]
        	            	Wraps: (4) poll-show-jobs
        	            	Wraps: (5) attached stack trace
        	            	  -- stack trace:
        	            	  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*leasedDescriptors).maybeUpdateDeadline
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/leased_descriptors.go:206
        	            	  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).MaybeUpdateDeadline
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:160
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:275
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt.func1
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:130
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithProfiling
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:2530
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:129
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1943
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1948
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1866
        	            	  | github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).runWithEx.func1
        	            	  | 	github.com/cockroachdb/cockroach/pkg/sql/internal.go:171
        	            	  | runtime.goexit
        	            	  | 	GOROOT/src/runtime/asm_amd64.s:1594
        	            	Wraps: (6) liveness session expired 6.50590502s before transaction
        	            	Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
        	Test:       	TestTransientClusterMultitenant

@knz knz added the A-jobs label Jan 30, 2023
@blathers-crl blathers-crl bot added the T-jobs label Jan 30, 2023
@ajwerner
Copy link
Contributor

Yeah, I think in some sense we're missing logic to treat this as a retry. I don't think the session actually died. I think that it was so overloaded that the sql query thought the session died. I'm trying to repro and understand. The job itself did not fail -- and it if it had, it would be retried.

@ajwerner
Copy link
Contributor

There's some weird things going on here. For one, I see both T2,n2 and T2,nsql2. Should that happen?

@knz
Copy link
Contributor

knz commented Jan 30, 2023

I'd like to see the details.
It's possible that we're propagating the thread ID across the "internal client adapter" - recent work that Andrei did to avoid a network hop for local RPCs. I know Andrei has other in-flight work in that area. Definitely worth investigating further, if only to file an issue.

@ajwerner
Copy link
Contributor

@ajwerner
Copy link
Contributor

The bisect points conclusively at 3ff9bc9.

@knz
Copy link
Contributor

knz commented Jan 30, 2023

Cool - I know @stevendanna is investigating something about that already.

@ajwerner
Copy link
Contributor

Some clues about things being weird:

I230130 16:39:52.843456 5320 1@server/tenant.go:610 ⋮ [T2,n3] 394  starting grpc server at ‹127.0.0.1:40203›
I230130 16:39:52.843467 5320 1@server/tenant.go:611 ⋮ [T2,n3] 395  advertising SQL server node at ‹127.0.0.1:40203›
I230130 16:39:52.886553 5320 sql/sqlinstance/instancestorage/instancestorage.go:249 ⋮ [T2,nsql?] 406  assigning instance id to rpc addr ‹127.0.0.1:40203› and sql addr ‹127.0.0.1:40003›
I230130 16:39:52.900307 5320 server/server_sql.go:1429 ⋮ [T2,nsql?] 409  bound sqlinstance: Instance{RegionPrefix: gA==, InstanceID: ‹2›, SQLAddr: ‹127.0.0.1:40003›, RPCAddr: ‹127.0.0.1:40203›, SessionID: 01018072478359e8c84780a42cb4de9ee7012b, Locality: ‹prize-winner=otan›}
I230130 16:39:53.281360 7324 1@circuitbreaker/circuitbreaker.go:322 ⋮ [T2,nsql2] 463  circuitbreaker: ‹rpc 127.0.0.1:40203 [n3]› tripped: unable to look up descriptor for n3: non existent SQL instance
I230130 16:39:53.281412 7324 1@circuitbreaker/circuitbreaker.go:447 ⋮ [T2,nsql2] 464  circuitbreaker: ‹rpc 127.0.0.1:40203 [n3]› event: ‹BreakerTripped›

Notice that tenant nsql2 seems to think that its own rpc address is n3's RPC address?

@msbutler
Copy link
Collaborator

msbutler commented Feb 1, 2023

i've hit this twice on an unrelated PR. Going to skip

msbutler added a commit to msbutler/cockroach that referenced this issue Feb 1, 2023
craig bot pushed a commit that referenced this issue Feb 1, 2023
96360: multitenant: skip TestTransientClusterMultiTenant r=knz a=msbutler

Informs #96162

Release note: None

Epic: none

Co-authored-by: Michael Butler <butler@cockroachlabs.com>
@rafiss rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Feb 13, 2023
@rafiss rafiss added the T-multitenant Issues owned by the multi-tenant virtual team label Feb 13, 2023
@ajstorm ajstorm added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 7, 2023
@dhartunian dhartunian added GA-blocker branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Mar 8, 2023
@knz knz removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. GA-blocker labels Mar 13, 2023
@knz
Copy link
Contributor

knz commented Jul 12, 2023

I found the likely cause of the flake: a cancellable context was passed to the Start() method.
I'll send a PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cluster-upgrades A-jobs branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-jobs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants