Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/streamingccl/streamingest: TestTenantStreamingPauseResumeIngestion failed #107434

Closed
cockroach-teamcity opened this issue Jul 24, 2023 · 5 comments · Fixed by #107666
Closed
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 24, 2023

ccl/streamingccl/streamingest.TestTenantStreamingPauseResumeIngestion failed with artifacts on release-23.1 @ 5fe2141d9f7c5e429e3e589a4dd722aa12840e53:

=== RUN   TestTenantStreamingPauseResumeIngestion
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestTenantStreamingPauseResumeIngestion2402613333
    test_log_scope.go:79: use -show-logs to present logs inline
    test_server_shim.go:439: server stop before successful start
        (1) attached stack trace
          -- stack trace:
          | github.com/cockroachdb/cockroach/pkg/server.(*channelOrchestrator).startControlledServer.func5.1
          | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_channel_orchestrator.go:307
          | github.com/cockroachdb/cockroach/pkg/server.(*channelOrchestrator).startControlledServer.func5
          | 	github.com/cockroachdb/cockroach/pkg/server/server_controller_channel_orchestrator.go:401
          | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
          | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
          | runtime.goexit
          | 	GOROOT/src/runtime/asm_amd64.s:1594
        Wraps: (2) server stop before successful start
        Error types: (1) *withstack.withStack (2) *errutil.leafError
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestTenantStreamingPauseResumeIngestion2402613333
--- FAIL: TestTenantStreamingPauseResumeIngestion (23.75s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-30041

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-disaster-recovery labels Jul 24, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jul 24, 2023
@stevendanna
Copy link
Collaborator

@lidorcarmel I think what is happening here is that the test runs to completion before the shared-process test tenant starts up on one of the nodes in the cluster. I know you are looking into another test in this same area so I've assigned to you for now.

@lidorcarmel
Copy link
Contributor

it doesn't repro locally, so I added a log fatal here (meaning failing without a retry):


and we crash with:

I230726 01:22:15.572018 15936 1@server/server_sql.go:1744  [T1,n1] 364  serving sql connections
I230726 01:22:15.576582 15936 testutils/testcluster/testcluster.go:1380  [-] 365  WaitForFullReplication
I230726 01:22:15.576732 15936 testutils/testcluster/testcluster.go:1384  [-] 366  WaitForFullReplication took: 335ns
I230726 01:22:15.856605 2088755 server/auto_upgrade.go:78  [T1,n1] 367  no need to upgrade, cluster already at the newest version
I230726 01:22:16.364270 2118800 jobs/registry.go:1588  [T1,n1] 368  AUTO UPDATE SQL ACTIVITY job 103: stepping through state running
I230726 01:22:16.455212 2118800 sql/sql_activity_update_job.go:81  [T1,n1,job=AUTO UPDATE SQL ACTIVITY id=103] 369  starting sql stats activity flush job
I230726 01:22:16.625763 2108282 kv/kvserver/replica_command.go:438  [T1,n1,s1,r63/1:/{Table/62-Max},*kvpb.AdminSplitRequest] 370  initiating a split of this range at key /Tenant/10 [r64] (manual)
I230726 01:22:16.662726 34667 kv/kvserver/tenantrate/factory.go:108  [T1,n1,s1,r64/1:{-}] 371  tenant 10 rate limiter initialized (rate: 3200 RU/s; burst: 32000 RU)
I230726 01:22:16.673817 2108282 kv/kvserver/replica_command.go:438  [T1,n1,s1,r64/1:/{Tenant/10-Max},*kvpb.AdminSplitRequest] 372  initiating a split of this range at key /Tenant/11 [r65] (manual)
I230726 01:22:16.700357 34805 kv/kvserver/tenantrate/factory.go:108  [T1,n1,s1,r65/1:{-}] 373  tenant 11 rate limiter initialized (rate: 3200 RU/s; burst: 32000 RU)
I230726 01:22:17.108966 2101005 server/server_controller_orchestration.go:118  [T1,n1] 374  tenant "source" has changed service mode, should now stop
I230726 01:22:17.109115 2179475 server/server_controller_channel_orchestrator.go:265  [-] 375  received request for tenant "source" to terminate gracefully
I230726 01:22:17.126748 2179492 server/server_controller_new_server.go:146  [n1] 376  creating tenant server
I230726 01:22:18.545169 2179492 server/server_controller_channel_orchestrator.go:365  [T10,tenant-orchestration,tenant=source,n1,start-server] 377  starting tenant server
I230726 01:22:18.545770 2179540 server/server_controller_channel_orchestrator.go:454  [tenant-orchestration,tenant=source] 378  propagate-ungraceful-stop task terminating
I230726 01:22:18.550965 2179475 server/server_controller_channel_orchestrator.go:284  [-] 379  propagate-close task terminating
I230726 01:22:18.571618 2179492 server/start_listen.go:103  [T10,n1] 380  server shutting down: instructing cmux to stop accepting
I230726 01:22:18.571754 2179492 1@util/log/event_log.go:32  [tenant-orchestration,tenant=source] 381 ={"Timestamp":1690334538571746000,"EventType":"tenant_shared_service_start","NodeID":1,"TenantName":"source","ErrorText":"while starting server: context canceled"}
W230726 01:22:18.571947 2179492 server/server_controller_channel_orchestrator.go:387  [tenant-orchestration,tenant=source] 382  unable to start server for tenant "source" (attempt 0, will retry): while starting server: context canceled
I230726 01:22:18.572185 2179492 1@server/server_controller_channel_orchestrator.go:391  [tenant-orchestration,tenant=source] 383  the server is terminating due to a fatal error (see the DEV channel for details)
I230726 01:22:18.764586 2269895 kv/kvserver/replica_consistency.go:233  [T1,n1,consistencyChecker,s1,r4/1:/System{/tsd-tse}] 385  triggering stats recomputation to resolve delta of {ContainsEstimates:4552 LastUpdateNanos:1690334535056401000 IntentAge:0 GCBytesAge:0 LiveBytes:-165602 LiveCount:-2276 KeyBytes:-129186 KeyCount:-2276 ValBytes:-36416 ValCount:-2276 IntentBytes:0 IntentCount:0 SeparatedIntentCount:0 RangeKeyCount:0 RangeKeyBytes:0 RangeValCount:0 RangeValBytes:0 SysBytes:0 SysCount:0 AbortSpanBytes:0}
F230726 01:22:18.572322 2179492 server/server_controller_channel_orchestrator.go:391  [tenant-orchestration,tenant=source] 384  FAILED HERE: while starting server: context canceled

are we trying to stop tenant source asynchronously and then quickly start it? and the stop races and cancels the context? I guess the race is one issue and the other question is why are we trying to stop anything, this should be startup code? I'll look tomorrow.
@knz & @stevendanna feel free to advise if you see anything here..

@knz
Copy link
Contributor

knz commented Jul 26, 2023

Fwiw i am.able to reproduce this locally under stress after about 10-15 runs

@lidorcarmel
Copy link
Contributor

I'm running:

./dev test pkg/ccl/streamingccl/streamingest:streamingest_test -f TestTenantStreamingPauseResumeIngestion --stress --cpus=8 -- --define gotags=bazel,gss,deadlock

on the gce worker, and I get:
152 runs so far, 0 failures, over 11m35s
same thing on the Mac, just slower.
:(
maybe you're running something else?

@lidorcarmel
Copy link
Contributor

ah! 400 runs didn't repro, but without the --cpus=8 I get a repro after 72 runs.
I guess we fail when things get slow because we're overloading cpus..

craig bot pushed a commit that referenced this issue Jul 27, 2023
107474: cli/zip: emit SQL table data using TSV by default r=abarganier a=knz

Fixes #107473.
Epic: CRDB-28893

This is a partial revert of 35738d4.

It changes the default value of the `--format` flag back from JSON to TSV.

Release note (backward-incompatible change):
THIS RELEASE NOTE CANCELS THE CORRESPONDING PREVIOUS BACKWARD-INCOMPATIBLE CHANGE. New behavior, compatible with previous versions of CockroachDB: the command `cockroach debug zip` stores data retrieved from SQL tables in the remote cluster using the TSV format by default.

Release note (cli change): The default value of the `--format` parameter to `cockroach debug zip` is `tsv`, like other CLI commands that can extract SQL data.

107489: sql: Delete invalid TestDropColumnAfterMutations test r=rafiss a=rimadeodhar

This test checks the functionality for the following sequence of events:
1. A txn adds a constraint to a column on a table.
2. A separate txn drops the column.

However, this interaction between the two txns has been made explicit by the PR #92289. Since this PR, step (2) will fail if the constraint in step (1) is in the process of being added. As a result, the elaborate set up and sequence of events being tested in TestDropColumnAfterMutations is no longer necessary.

Release note: none
Epic: none
Fixes: #76843

107664: authors: add Angela Dietz to authors r=angeladietz a=angeladietz

Release note: None
Epic: None

107666: server: fix a race in tenant creation r=knz a=lidorcarmel

Previously, scanTenantsForRunnableServices() was not holding the mutex when SELECTing for the existing tenant names, which means that the following may happen:
- scanTenantsForRunnableServices() sees that only the system tenant exists
- createServerEntryLocked() then adds another tenant while holding the mutex
- scanTenantsForRunnableServices() takes the lock and stops the tenant that was just created because only the system tenant should be alive (which is wrong)

This patch changes scanTenantsForRunnableServices() to take the mutex before SELECTing for the existing tenants in order to avoid the race.

Epic: none
Fixes: #107434
Fixes: #107343
Fixes: #107154

Release note: None

107673: opt: remove Metadata.AllUserDefinedFunctions r=mgartner a=mgartner

The metadata method `AllUserDefinedFunctions` has been replaced with a
new function `HasUserDefinedFunctions` which provides a simpler API
without exposing the underlying UDF dependency map. The map is still
available outside of the opt package via the `TestingUDFDeps` method
which is designed for testing use only.

Epic: None

Release note: None


107714: roachtest: add warning to redacted github issue r=mgartner a=mgartner

Epic: None

Release note: None

107716: ui: extend search logic on insights page r=koorosh a=koorosh

This change extends the number of fields where
search is applied (instead of single transaction/
statement execution ID field).
It makes possible to search for any available ID
in Txn or statement insight.

Release note (ui change): search is performed on all ID fields of transaction and statement insights.

Resolves: #107253

Demo:

https://github.com/cockroachdb/cockroach/assets/3106437/7fb56720-3ab2-4be4-9500-457707f6f01d



Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: rimadeodhar <rima@cockroachlabs.com>
Co-authored-by: Angela Dietz <dietz@cockroachlabs.com>
Co-authored-by: Lidor Carmel <lidor@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Andrii Vorobiov <and.vorobiov@gmail.com>
@craig craig bot closed this as completed in 5ca5703 Jul 27, 2023
lidorcarmel added a commit that referenced this issue Jul 28, 2023
Previously, scanTenantsForRunnableServices() was not holding the mutex when
SELECTing for the existing tenant names, which means that the following may
happen:
- scanTenantsForRunnableServices() sees that only the system tenant exists
- createServerEntryLocked() then adds another tenant while holding the mutex
- scanTenantsForRunnableServices() takes the lock and stops the tenant that
  was just created because only the system tenant should be alive (which
  is wrong)

This patch changes scanTenantsForRunnableServices() to take the mutex before
SELECTing for the existing tenants in order to avoid the race.

Epic: none
Fixes: #107434

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants