multitenant: Investigate tests which fail when run using the default test tenant #76378

ajstorm · 2022-02-10T15:59:15Z

A change was made with #76582 to enable all tests which leverage testServer/testCluster to run with some probability in multi-tenant mode. In doing so, several tests were found to fail or hang when run under multi-tenant mode. This tracking issue enumerates all of the problematic test cases so that they can be investigated in the future.

Jira issue: CRDB-13093

Remaining tests that specify the parameter `TODOTestTenantDisabled`

By product area:

Disaster recovery

49 pkg/ccl/backupccl
14 pkg/ccl/streamingccl
2 pkg/ccl/jobsccl

KV

9 pkg/ccl/spanconfigccl
5 pkg/kv/kvserver
2 pkg/spanconfig/spanconfigkvaccessor
1 pkg/kv/kvclient/rangefeed/rangefeedcache

Mixed bag: Obs Inf - DB Server - Multi-tenant

Multi-tenant

9 pkg/ccl/multitenantccl
4 pkg/kv/kvclient/kvtenant
2 pkg/configprofiles
2 pkg/base
1 pkg/ccl/kvccl/kvtenantccl/upgradeinterlockccl
1 pkg/multitenant/tenantcapabilities
1 pkg/ccl/testutilsccl

SQL / Migrations

CDC

10 pkg/ccl/changefeedccl
9 pkg/ccl/changefeedccl that are disabled via feedTestNoTenants option

Serverless

11 pkg/ccl/sqlproxyccl

Remaining tests that were hidden due to the CCL import (#108761)

Disaster recovery / KV

34 kv/bulk

Disaster recovery

61 jobs

CDC

8 ccl/changefeedccl/cdceval
7 ccl/changefeedccl/cdcevent
3 kv/kvclient/rangefeed
1 kv/kvclient/rangefeed/rangefeedcache

KV

SQL Queries

SQL Foundations

SQL (mixed)

8 sql/logictest/tests/local-vec-off
8 sql/logictest/tests/local-legacy-schema-changer
7 sql/logictest/tests/fakedist
7 sql/logictest/tests/fakedist-disk
5 sql/logictest/tests/fakedist-vec-off
5 sql/logictest/tests/5node-disk
4 sql/logictest/tests/local-mixed-22.2-23.1
4 sql/logictest

Multi-tenant

4 multitenant/tenantcapabilities/tenantcapabilitieswatcher
2 server/settingswatcher
2 server/tenantsettingswatcher
1 server/systemconfigwatcher
1 server/autoconfig

Obs Infra

24 server
3 obsservice/obslib/ingest
2 util/log/logcrash

Clust Obs

10 server/application_api
1 server/storage_api

Misc

40 upgrade/upgrades
1 security

The text was updated successfully, but these errors were encountered:

Previously, testServer would run by default in single-tenant mode. This PR changes testServer to run by default in multi-tenant mode, and runs all statements through the default test tenant. Since we want the bulk of our testing to remain in single-tenant mode, we only probabilistically run with tenants (with probability 0.25). This commit also contains a few pieces of cleanup which were necessary to make the testing changes (and were difficult to split into a separate commit): - Change the "Existing" flag on tenant start to "DisableCreateTenant" and make the tenant start code more tolerant of existing tenants by checking to see if the tenant exists before trying to create it. The DisableCreateTenant flag is required for the cases where we want a test to fail due to the lack of a created tenant. - Changed the multi-region backup tests to use node 0 instead of node 1, to allow them to work within a tenant. Tenants aren't able to access remote nodes of nodelocal stoarge. - Cleaned up a couple of cases where our error handling was passing nil structs to be printed out. - Had to special case the setting of kv.range_merge.queue_enabled because it is only available to the system SQL server. As a result, all tests which check for its setting had to be modified to no longer check for it (since it will only be set in the case where we have no SQL server). Additionally, we've created a tracking issue for all tests which currently don't work under the default test tenant (cockroachdb#76378). Release note: None

104228: dev: replace `stress` with `--runs_per_test` r=rail a=rickystewart `--runs_per_test` does the same thing that `stress` does (runs a test many times to weed out flakes), but better, in that: 1. Better Bazel support -- we had to hack Bazel support into `stress` to keep it working 2. Bazel gives us statistics and control over how the tests are scheduled in a way that is standardized as opposed to the ad-hoc arguments one can pass to `stress` 3. Bazel can collect logs and artifacts for different test runs and expose them to the user in a way that `stress` cannot 4. Drop-in support for remote execution when we have access to it, obviating the need for `roachprod-stress` For now, the default implementation of `dev test --stress` is to run the test 1,000 times, stopping the build if any test fails. This is meant to replicate how people use `stress` (i.e., just start it running and stop if there are any failures). The 1,000 figure is arbitrary and meant to just be a "very high number" of times to run unit tests. The default figure of 1,000 can be adjusted with `--count`. Also update documentation with some new recipes and add extra logging to warn about the change in behavior. Closes #102879 Epic: none Release note: None 107765: sql: remove calls to CreateTestServerParams r=yuzefovich a=yuzefovich This PR removes calls to `tests.CreateTestServerParams` outside of `sql_test` tests as well as does some miscellaneous cleanups along the way. The rationale for doing this is that vast majority of callers didn't actually need the setup performed by the helper but inherited the disabling of test tenant. It seems more prudent for each test to be explicit about the kind of testing knobs and settings it needs. Informs: #76378. Epic: CRDB-18499. Release note: None 107889: concurrency: enable joint claims on the request scan path r=nvanbenschoten a=arulajmani This patch addresses a TODO that was blocking joint claims on a request's scan path and adds a test to show joint claims work. In particular, previously the lock mode of a queued request was hardcoded to use locking strength `lock.Intent`. We now use the correct lock mode by snapshotting an immutable copy of the request's lock strength on to the queuedGuard when inserting a it into a lock's wait queue. Informs #102275 Release note: None 108052: changefeedccl: Release allocation when skipping events r=miretskiy a=miretskiy The changefeed (or KV feed to be precise) may skip some events when "scan boundary" is reached. Scan boundary is a timestamp when certain event occurs -- usually a schema change. But, it may also occur when the `end_time` option is set. The KV feed ignores events that have MVCC timestamp greater or equal to the scan boundary event. Unfortunately, due to a long outstanding bug, the memory allocation associated with the event would not be released when KV feed decides to skip the event. Because of this, allocated memory was "leaked" and not reclaimed. If enough additional events arrive, those leaked events may account for all of the memory budget, thus leading to inability for additional events to be added. This bug impacts any changefeeds running with the `end_time` option set. It might also impact changefeeds that observe normal schema change event, though this situation is highly unlikely(the same transaction that perform schema change had to have modified sufficient number of rows in the table to fill up all of the memory budget). Fixes #108040 Release note (enterprise change): Fix a potential "deadlock" when running changefeed with `end_time` option set. 108066: dev: set `COCKROACH_DEV_LICENSE` for `compose` r=rail a=rickystewart This test won't run without the license key set. Epic: CRDB-17171 Release note: None Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Arul Ajmani <arulajmani@gmail.com> Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>

108628: ui: user cluster settings from redux r=maryliag a=maryliag Part Of #108373 https://www.loom.com/share/e8b2bc222db848f7a55d442c23c31fe6 Use the value of the cluster setting `sql.index_recommendation.drop_unused_duration` from redux, instead of adding as part of the select. With this change, now users with VIEWACTIVITY or VIEWACTIVITYREDACTED can see index recommendations on the console, without the need the view cluster settings permission. This commit fixes on api from Database pages (Database Details and Table Details). Release note: None 108636: tests: rename `StartNewTestCluster` to `StartCluster` r=yuzefovich a=knz Part of #107986. Epic: CRDB-18499 This follows for symmetry with `StartServer`. This is a simple searc-replace substitution: `serverutils.StartNewTestCluster` -> `serverutils.StartCluster` Release note: None 108721: sqlccl: remove base.TODOTestTenantDisabled r=yuzefovich a=yuzefovich In all of the touched tests we control the tenants explicitly. Also add some log scopes. Informs: #76378. Epic: CRDB-18499. Release note: None 108734: ui: fix filter font size r=maryliag a=maryliag On CC Console, the font size of the filter was using a wrong value inherited from another class. This commit makes the value explicit to be consistent. Before <img width="623" alt="Screenshot 2023-08-14 at 3 08 24 PM" src="https://github.com/cockroachdb/cockroach/assets/1017486/59bc6306-642e-4fbb-947f-500dd335ec10"> After <img width="1187" alt="Screenshot 2023-08-14 at 3 07 39 PM" src="https://github.com/cockroachdb/cockroach/assets/1017486/27034065-97a5-40a5-863d-231debc6faad"> Epic: none Release note: None Co-authored-by: maryliag <marylia@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

108739: roachtest/costfuzz: create failure.log file on demand r=yuzefovich a=yuzefovich This commit makes it so that `failure.log` files are created lazily on demand in `costfuzz` and `unoptimized_query_oracle` roachtests. It seems nicer this way since for successful runs previously we'd see empty failure log files which were a bit confusing. Epic: None Release note: None 108744: build: update aws-cli r=pavelkalinnikov,srosenberg,herkolategan,rail a=renatolabs The version being previously used was quite old, and didn't support setting the `Throughput` parameter on `gp3` volumes. See: #108629 (comment). Epic: none Release note: None 108748: CODEOWNERS: split sql-queries-prs from sql-queries r=knz,rytaft a=michae2 This allows us to add or remove people from the PR rotation while keeping them on the team. Epic: None Release note: None 108752: ui: fix errors thrown by 0.toNumber() r=sjbarag a=sjbarag When generating JS protobuf bindings, protobufjs's `pbjs` CLI accepts a --strict-long (or the more modern --force-long) flag with the following description: > Enfores the use of 'Long' for s-/u-/int64 and s-/fixed64 fields. This uses the `long` package on NPM[^1], which it expects to be present when one of those clients is used at runtime. If `long` isn't available when `pbjs` is executed, it falls back to generating number-based fields where possible. Long is available in all consumers of `crdb-protobuf-client`, and `protobufjs` declares `long` as one of its few build-time and run-time dependencies, so nothing looked alarming. For each field with a type that might be represented by a Long (a heuristic influenced by --strict-long), that field is marked as "long" if and only if the protobufjs-internal util.Long is non-null[^2]. That util.Long value is set via the use of the ``@protobufjs/inquire`` package instead of a standard `require`[^3] though, since the former silently returns `null` instead of throwing an error[^4]. This, again, is not alarming on its own because `protobufjs` depends directly on `long`. The complication comes in with rules_js' use of the pnpm visibility rules, which are very strict when it comes to JS package dependencies. Put simply: packages only have access to the dependencies they directly declare, not transitive dependencies ("phantom dependencies")[^5]. Since ``@protobufjs/inquire`` is a separate package with no direct dependencies of its own[^6], Bazel lays out a node_modules tree where ``@protobufjs/inquire`` has no access to the `long` package. Since inaccessible packages are indistinguishable from nonexistent packages and `inquire` is intended to be silent, it returns `null`. The `pbjs` CLI then assumes Long isn't installed and falls back to number-based fields where a Long isn't strictly required. When combined with protobuf's zero-value, this caused several cases of '0.toNumber is not a function' JS errors at runtime[^7]. We requested a Long, we support Longs, but a client was silently generated without respecting --strict-long. Like with most other phantom dependencies, use pnpm.packageOverrides to declare that `@protobufjs/inquire` does in fact depend on Long. [^1]: https://www.npmjs.com/package/long [^2]: https://github.com/protobufjs/protobuf.js/blob/918ff014efe19f3eb43195ae3d71f7aeb3fcdd73/src/field.js#L153-L157 [^3]: https://github.com/protobufjs/protobuf.js/blob/918ff014efe19f3eb43195ae3d71f7aeb3fcdd73/src/util/minimal.js#L163-L167 [^4]: https://github.com/protobufjs/protobuf.js/blob/918ff014efe19f3eb43195ae3d71f7aeb3fcdd73/lib/inquire/index.js#L4-L17 [^5]: https://docs.aspect.build/rules/aspect_rules_js/docs/pnpm/#hoisting [^6]: https://github.com/protobufjs/protobuf.js/blob/918ff014efe19f3eb43195ae3d71f7aeb3fcdd73/lib/inquire/package.json [^7]: #106318 Fixes: #106318 Release note: Observability pages no longer crash when they encounter zeros (e.g. a session with no memory allocated). 108764: sql/physicalplan: make more tests compatible with secondary tenants r=maryliag,yuzefovich a=knz Informs #76378. Links to #108763. Epic: CRDB-18499 108765: sql: make TestScatterResponse work with secondary tenants r=yuzefovich a=knz First two commits from #108764 Informs #76378. Epic: CRDB-18499 Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Sean Barag <barag@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

…108925 107863: sql: add bitwise operation for varbit with variable length r=rharding6373 a=cty123 As raised in #107821, the existing bitwise operations in CRDB follow the same standard as Postgresql that require the two operands of varbit type to have the same length. The restriction makes it easier for the implementation but harder for users. As of today, we need to apply casting and left padding to make sure the operands have the same length. Instead, it might be useful to have CRDB built-in functions that can handle this. Here I have implemented 2 basic bitwise operation functions that are tailored for varbit typed data of arbitrary length. 108525: sql: TestRandomSyntaxFunctions: skip function call and fix CREATE STATS use of span after finish r=fqazi a=fqazi Previously, this test could depending on the concurrency, could spawn a large number of schema telemetry jobs via (ctdb_internal.create_sql_schema_telemetry_job). This could lead to contention that will eventually cause this test to time out. To address this, this patch limits calling the telemetry job creation function. Fixes: #108153 Release note: None 108807: server,log: report tenant names in logging output r=stevendanna,abargainier a=knz Epic: CRDB-27982 Fixes #103406. Prior to this change, the tenant details reported in logging output were limited to just the tenant ID. This is because the original tenant server initialization code only had access to the tenant ID (provided as CLI argument). Since recently, the tenant name is also becoming known during tenant server initialization. This is currently done via the tenant connector. This commit expands on this foundation by injecting the tenant name into the logging output as soon as it is available. For example: ``` I230815 19:31:07.290757 922 sql/temporary_schema.go:554 ⋮ [T4,Vblah,n1] 148 found 0 temporary schemas ^^^^^ here ``` or using JSON: ``` {"channel_numeric":0,"channel":"DEV",...,"tenant_id":4,"tenant_name":"blah",...} ^^^^^^^^^^^^^^^^^^^^ here ``` Note: we are choosing to report the tenant name *in addition* to the tenant ID to preserve compatibilit with automation (eg. logspy) which needs to filter entries based on ID. Release note (cluster virtualization): The name of the virtual cluster (tenant), when known, is now reported in logging events. Refer to the documentation of individual logging format for details. 108823: row: remove no longer nil txn check r=yuzefovich a=yuzefovich I believe this non-nil txn check is no longer needed as of 8f7f2f4 (which fixed the only known place where we forgot to set the txn field on the FlowCtx). Epic: None Release note: None 108858: *: bump some core dependencies r=knz a=rickystewart Unfortunately we're still getting `nogo` crashes. Try bumping these dependencies even further. Informs #99988. Epic: none Release note: None 108866: storage: Write sstables in v4 format for shared ingestions r=RaduBerinde a=itsbilal We require sstables to be written with obsolete bits for shared sstable ingestion to work correctly. This change moves to writing sstables with the new format if the cluster version is high enough. Unblocks #107394. Epic: none Release note: None 108886: kvstreamer, row: adjust all tests to work with the test tenant r=yuzefovich a=yuzefovich Epic: CRDB-18499 Informs #76378 Release note: None 108924: logictest: avoid FATAL during cleanup r=rafiss a=rafiss This prevents the test from failing if there is an error closing a connection at the end of the test. There is no strong need for us to assert that closing the connection worked. fixes #108833 Release note: None 108925: streamingest: speed-up c2c cutover r=lidorcarmel a=lidorcarmel Each c2c processor checks the job info table every 30 seconds to see whether a cutover has started. Normally a cutover should take about 10 seconds. This pr changes the default to 10 seconds, to reduce the total time for cutting over. Epic: none Release note: None Co-authored-by: cty123 <ctychen2216@gmail.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Lidor Carmel <lidor@cockroachlabs.com>

108597: metrics: assign histogram metric type on histogram construction r=ericharmeling a=ericharmeling This commit assigns prometheusgo.MetricType_HISTOGRAM to the Metadata.MetricType on histogram construction. Before this change, GetMetadata() was returning the Metadata.MetricType zero value (prometheusgo.MetricType_COUNTER) for all histograms that did not explicitly specify the prometheusgo.MetricType_HISTOGRAM for Metadata.MetricType in their Metadata definitions. This prevented checks on histogram Metadata.MetricType from properly evaluating the metrics as histograms. Fixes #106448. Fixes #107701. Releaes note: None 109345: changefeedccl: refactor kvfeed startup in changeaggregator processor r=miretskiy a=jayshrivastava changefeedccl: refactor kvfeed startup in changeaggregator processor This change cleans up the code used to start up the kv feed in change aggregator processors. This change removes uncessessary code, adds a better API, and makes code easier to reason about. Informs: #96953 Release note: None Epic: None 109386: sql: adjust many tests to work with test tenant r=yuzefovich a=yuzefovich Epic: CRDB-18499 Informs #76378 Release note: None 109476: dev: make `dev test --count 1` invalidate cached test results r=dt a=rickystewart This matches the behavior of `go test`. Epic: none Release note: None 109506: changefeedccl: ensure rangefeed setting is enabled in tests r=miretskiy a=jayshrivastava Previously, many tests which create rangefeeds would not explicitly set the `kv.rangefeed.enabled` setting to be true. These tests would still work because, by default, rangefeeds are enabled via span configs. However, it was observed that span configs are not immediately applied when range splits occur. This would cause the testing rangefeed reader to encounter errors and/or timeout on very rare occasions. See #109306 (comment) for more info. This change updates these tests to set the `kv.rangefeed.enabled` cluster setting to be true, which removes the dependency on span configs. Closes: #109306 Epic: None Release note: None 109511: concurrency: use generic lists in the lock table r=nvanbenschoten a=arulajmani Now that cda4fa2 has landed, we can make use of generic lists in a few places in the lock table. Epic: none Release note: None Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com> Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Arul Ajmani <arulajmani@gmail.com>

109329: *: make more tests compatible with cluster virtualization r=yuzefovich a=knz Epic: CRDB-18499 Informs #76378. Prerequisite to #108762. (I manually tested this works with #108762) Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

110518: row: adjust TestRowFetcherMVCCMetadata to work with secondary tenants r=yuzefovich a=yuzefovich The problem was that we previously iterated over the whole key space (beyond the tenant's boundaries) which then we couldn't correctly decode using the row fetcher. Fixes: #109396. Informs: #76378. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

Almost all streamingccl manage tenants explicitly. In a few cases the test technically works in a secondary tenant, but I haven't gone out of my way to find more that might. Informs cockroachdb#76378 Release note: None

112165: streamingccl: remove uses of base.TODOTestTenantDisabled r=msbutler a=stevendanna Almost all streamingccl manage tenants explicitly. In a few cases the test technically works in a secondary tenant, but I haven't gone out of my way to find more that might. Informs #76378 Release note: None Co-authored-by: Steven Danna <danna@cockroachlabs.com>

ajstorm added C-investigation Further steps needed to qualify. C-label will change. A-multitenancy Related to multi-tenancy labels Feb 10, 2022

ajstorm mentioned this issue Feb 11, 2022

multitenant: randomly run tests with SQL servers (formerly known as tenants) #73085

Closed

ajstorm mentioned this issue Feb 15, 2022

multitenant: randomly run tests with tenants #76580

Closed

ajstorm mentioned this issue Feb 15, 2022

multitenant: randomly run tests with tenants #76582

Merged

yuzefovich mentioned this issue Aug 14, 2023

sqlccl: remove base.TODOTestTenantDisabled #108721

Merged

This was referenced Aug 15, 2023

sql: figure out how to populate range cache in secondary tenants #108763

Open

sql/physicalplan: make more tests compatible with secondary tenants #108764

Merged

sql: make TestScatterResponse work with secondary tenants #108765

Merged

knz mentioned this issue Aug 16, 2023

testserver: do not verify enterprise license #108762

Merged

yuzefovich mentioned this issue Aug 17, 2023

kvstreamer, row: adjust all tests to work with the test tenant #108886

Merged

knz mentioned this issue Aug 23, 2023

*: make more tests compatible with cluster virtualization #109329

Merged

yuzefovich mentioned this issue Aug 24, 2023

sql: adjust many tests to work with test tenant #109386

Merged

yuzefovich mentioned this issue Sep 13, 2023

row: adjust TestRowFetcherMVCCMetadata to work with secondary tenants #110518

Merged

knz added T-multitenant Issues owned by the multi-tenant virtual team and removed T-testeng TestEng Team C-test-failure Broken test (automatically or manually discovered). labels Sep 13, 2023

stevendanna mentioned this issue Oct 11, 2023

streamingccl: remove uses of base.TODOTestTenantDisabled #112165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multitenant: Investigate tests which fail when run using the default test tenant #76378

multitenant: Investigate tests which fail when run using the default test tenant #76378

ajstorm commented Feb 10, 2022 •

edited by dhartunian

Loading

multitenant: Investigate tests which fail when run using the default test tenant #76378

multitenant: Investigate tests which fail when run using the default test tenant #76378

Comments

ajstorm commented Feb 10, 2022 • edited by dhartunian Loading

Remaining tests that specify the parameter TODOTestTenantDisabled

Remaining tests that were hidden due to the CCL import (#108761)

ajstorm commented Feb 10, 2022 •

edited by dhartunian

Loading

Remaining tests that specify the parameter `TODOTestTenantDisabled`