tsdb: add tenant-level store metrics to tsdb #99860

abarganier · 2023-03-28T20:55:48Z

Previously, while we had the ability to show tenant-level store metrics in the /_status/vars page, these metrics were never written to tsdb.

This is despite the changes in #98077, which did a great job at writing all the metrics in the tenant-specific metric registries, but didn't pull the tenant-specific store metrics out of the store registries. This is because these metrics exist as child metrics on the store registry metrics, and we did not previously have logic to individually pick these metrics out of their parent AggGauge/Counter metrics.

This patch adds the logic to do so. Now, for each tenant ID that exists in the recorder's tenantRegistries map, we will attempt to pick that tenant's individual child metric & values from all the metrics that exist in TenantsStorageMetrics. This will limit the writing of these tenant-level metrics to only happen in deployments where multiple tenants are running in-process, as environments such as serverless clusters are expected to have an empty tenantRegistries map in this recorder.

This is necessary because if we're going to support multi-tenant tsdb, app tenants should be able to see core storage information about their logical cluster, such as livebytes which indicates how much live active data exists for the cluster.

Release note: none

Fixes: #99228

cockroach-teamcity · 2023-03-28T20:55:52Z

All committers have signed the CLA.

cockroach-teamcity · 2023-03-28T20:56:02Z

This change is

aadityasondhi

this is actually pretty cool!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian)

abarganier

TFTR! I realized I was pulling the tenantMetricsSet with each store iteration, which was unnecessary, so I moved it outside the store registry loop.

CI was in a rough state when I originally put up this PR, so let's see if things are in a better state now.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian)

Previously, while we had the ability to show tenant-level store metrics in the `/_status/vars` page, these metrics were never written to tsdb. This is despite the changes in cockroachdb#98077, which did a great job at writing all the metrics in the tenant-specific metric registries, but didn't pull the tenant-specific store metrics out of the store registries. This is because these metrics exist as *child* metrics on the store registry metrics, and we did not previously have logic to individually pick these metrics out of their parent AggGauge/Counter metrics. This patch adds the logic to do so. Now, for each tenant ID that exists in the recorder's `tenantRegistries` map, we will attempt to pick that tenant's individual child metric & values from all the metrics that exist in TenantsStorageMetrics. This will limit the writing of these tenant-level metrics to only happen in deployments where multiple tenants are running in-process, as environments such as serverless clusters are expected to have an empty `tenantRegistries` map in this recorder. Release note: none

abarganier · 2023-04-04T17:37:07Z

(fixed an import cycle)

abarganier · 2023-04-04T19:48:42Z

bors r=aadityasondhi

craig · 2023-04-04T20:23:29Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-04-04T20:23:40Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error setting reviewers, but backport branch blathers/backport-release-23.1-99860 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/100636/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

99663: sql: update connExecutor logic for pausable portals r=ZhouXing19 a=ZhouXing19 This PR replaces #96358 and is part of the initial implementation of multiple active portals. ---- This PR is to add limited support for multiple active portals. Now portals satisfying all following restrictions can be paused and resumed (i.e., with other queries interleaving it): 1. Not an internal query; 2. Read-only query; 3. No sub-queries or post-queries. And such a portal will only have the statement executed with a _non-distributed_ plan. This feature is gated by a session variable `multiple_active_portals_enabled`. When it's set `true`, all portals that satisfy the restrictions above will automatically become "pausable" when being created via the pgwire `Bind` stmt. The core idea of this implementation is 1. Add a `switchToAnotherPortal` status to the result-consumption state machine. When we receive an `ExecPortal` message for a different portal, we simply return the control to the connExecutor. (#99052) 2. Persist `flow` `queryID` `span` and `instrumentationHelper` for the portal, and reuse it when we re-execute a portal. This is to ensure we _continue_ the fetching rather than starting all over. (#99173) 3. To enable 2, we need to delay the clean-up of resources till we close the portal. For this we introduced the stacks of cleanup functions. (This PR) Note that we kept the implementation of the original "un-pausable" portal, as we'd like to limit this new functionality only to a small set of statements. Eventually some of them should be replaced (e.g. the limitedCommandResult's lifecycle) with the new code. Also, we don't support distributed plan yet, as it involves much more complicated changes. See `Start with an entirely local plan` section in the [design doc](https://docs.google.com/document/d/1SpKTrTqc4AlGWBqBNgmyXfTweUUsrlqIaSkmaXpznA8/edit). Support for this will come as a follow-up. Epic: CRDB-17622 Release note (sql change): initial support for multiple active portals. Now with session variable `multiple_active_portals_enabled` set to true, portals satisfying all following restrictions can be executed in an interleaving manner: 1. Not an internal query; 2. Read-only query; 3. No sub-queries or post-queries. And such a portal will only have the statement executed with an entirely local plan. 99947: ui: small fixes to DB Console charts shown for secondary tenants r=dhartunian a=abarganier #97995 updated the DB Console to filter out KV-specific charts from the metrics page when viewing DB Console as a secondary application tenant. The PR missed a couple small details. This patch cleans those up with the following: - Removes KV latency charts for app tenants - Adds a single storage graph for app tenants showing livebytes - Removes the "Capacity" chart on the Overview dashboard for app tenants Release note: none Epic: https://cockroachlabs.atlassian.net/browse/CRDB-12100 NB: Please only review the final commit. 1st commit is being reviewed separately @ #99860 100188: changefeedccl: pubsub sink refactor to batching sink r=rickystewart a=samiskin Epic: https://cockroachlabs.atlassian.net/browse/CRDB-13237 This change is a followup to #99086 which moves the Pubsub sink to the batching sink framework. The changes involve: 1. Moves the Pubsub code to match the `SinkClient` interface, moving to using the lower level v1 pubsub API that lets us publish batches manually 3. Removing the extra call to json.Marshal 4. Moving to using the `pstest` package for validating results in unit tests 5. Adding topic handling to the batching sink, where batches are created per-topic 6. Added a pubsub_sink_config since it can now handle Retry and Flush config settings 7. Added metrics even to the old pubsub for the purpose of comparing the two versions At default settings, this resulted in a peak of 90k messages per second on a single node with throughput at 27.6% cpu usage, putting it at a similar level to kafka. Running pubsub v2 across all of TPCC (nodes ran out of ranges at different speeds): <img width="637" alt="Screenshot 2023-03-30 at 3 38 25 PM" src="https://user-images.githubusercontent.com/6236424/229863386-edaee27d-9762-4806-bab6-e18b8a6169d6.png"> Running pubsub v1 (barely visible, 2k messages per second) followed by v2 on tpcc.order_line (in v2 only 2 nodes ended up having ranges assigned to them): <img width="642" alt="Screenshot 2023-04-04 at 12 53 45 PM" src="https://user-images.githubusercontent.com/6236424/229863507-1883ea45-d8ce-437b-9b9c-550afec68752.png"> In the following graphs from the cloud console, where v1 was ran followed by v2, you can see how the main reason v1 was slow was that it wasn't able to batch different keys together. <img width="574" alt="Screenshot 2023-04-04 at 12 59 51 PM" src="https://user-images.githubusercontent.com/6236424/229864083-758c0814-d53c-447e-84c3-471cf5d56c44.png"> Publish requests remained the same despite way more messages in v2 <img width="1150" alt="Screenshot 2023-04-04 at 1 46 51 PM" src="https://user-images.githubusercontent.com/6236424/229875314-6e07177e-62c4-4c15-b13f-f75e8143e011.png"> Release note (performance improvement): pubsub sink changefeeds can now support higher throughputs by enabling the changefeed.new_pubsub_sink_enabled cluster setting. 100620: pkg/server: move DataDistribution to systemAdminServer r=dhartunian a=abarganier The DataDistribution endpoint reports replica counts by database and table. When it was built, it operated off the assumption that a range would only ever contain a single table's data within. Now that we have coalesced ranges, a single range can span multiple tables. Unfortunately, the DataDistribution endpoint does not take this fact into account, meaning it reports garbled and inaccurate data, unless the `spanconfig.storage_coalesce_adjacent.enabled` setting is set to false (see #98820). For secondary tenants, ranges are *always* coalesced, so this endpoint in its current state could never report meaningful data for a tenant. Given all of this, we have decided to make this endpoint only available for the system tenant. This patch accomplishes this by moving the endpoint away from the adminServer and into the systemAdminServer, making it effectively unimplemented for secondary tenants. Release note: none Informs: #97942 Co-authored-by: Jane Xing <zhouxing@uchicago.edu> Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com> Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com>

abarganier requested review from dhartunian and a team March 28, 2023 20:55

abarganier requested review from a team as code owners March 28, 2023 20:55

abarganier force-pushed the tenant-storage-metrics branch from 56e9c32 to 89fa067 Compare March 29, 2023 16:27

abarganier mentioned this pull request Mar 29, 2023

ui: small fixes to DB Console charts shown for secondary tenants #99947

Merged

abarganier force-pushed the tenant-storage-metrics branch from 89fa067 to 6a92d3f Compare March 29, 2023 20:53

abarganier added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Mar 29, 2023

aadityasondhi approved these changes Apr 3, 2023

View reviewed changes

abarganier force-pushed the tenant-storage-metrics branch from 6a92d3f to 32ab78a Compare April 4, 2023 14:42

abarganier commented Apr 4, 2023

View reviewed changes

abarganier force-pushed the tenant-storage-metrics branch from 32ab78a to 4dd0304 Compare April 4, 2023 17:18

craig bot merged commit 0d15e0d into cockroachdb:master Apr 4, 2023

blathers-crl bot mentioned this pull request Apr 4, 2023

release-23.1: tsdb: add tenant-level store metrics to tsdb #100636

Merged

blathers-crl bot mentioned this pull request Apr 7, 2023

release-23.1: ui: small fixes to DB Console charts shown for secondary tenants #100931

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsdb: add tenant-level store metrics to tsdb #99860

tsdb: add tenant-level store metrics to tsdb #99860

abarganier commented Mar 28, 2023 •

edited

Loading

cockroach-teamcity commented Mar 28, 2023 •

edited

Loading

cockroach-teamcity commented Mar 28, 2023

aadityasondhi left a comment

abarganier left a comment

abarganier commented Apr 4, 2023

abarganier commented Apr 4, 2023

craig bot commented Apr 4, 2023

blathers-crl bot commented Apr 4, 2023

tsdb: add tenant-level store metrics to tsdb #99860

tsdb: add tenant-level store metrics to tsdb #99860

Conversation

abarganier commented Mar 28, 2023 • edited Loading

cockroach-teamcity commented Mar 28, 2023 • edited Loading

cockroach-teamcity commented Mar 28, 2023

aadityasondhi left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

abarganier commented Apr 4, 2023

abarganier commented Apr 4, 2023

craig bot commented Apr 4, 2023

blathers-crl bot commented Apr 4, 2023

abarganier commented Mar 28, 2023 •

edited

Loading

cockroach-teamcity commented Mar 28, 2023 •

edited

Loading