Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsdb: add tenant-level store metrics to tsdb #99860

Merged
merged 1 commit into from
Apr 4, 2023

Conversation

abarganier
Copy link
Contributor

@abarganier abarganier commented Mar 28, 2023

Previously, while we had the ability to show tenant-level store metrics in the /_status/vars page, these metrics were never written to tsdb.

This is despite the changes in #98077, which did a great job at writing all the metrics in the tenant-specific metric registries, but didn't pull the tenant-specific store metrics out of the store registries. This is because these metrics exist as child metrics on the store registry metrics, and we did not previously have logic to individually pick these metrics out of their parent AggGauge/Counter metrics.

This patch adds the logic to do so. Now, for each tenant ID that exists in the recorder's tenantRegistries map, we will attempt to pick that tenant's individual child metric & values from all the metrics that exist in TenantsStorageMetrics. This will limit the writing of these tenant-level metrics to only happen in deployments where multiple tenants are running in-process, as environments such as serverless clusters are expected to have an empty tenantRegistries map in this recorder.

This is necessary because if we're going to support multi-tenant tsdb, app tenants should be able to see core storage information about their logical cluster, such as livebytes which indicates how much live active data exists for the cluster.

Release note: none

Fixes: #99228

@abarganier abarganier requested review from dhartunian and a team March 28, 2023 20:55
@abarganier abarganier requested review from a team as code owners March 28, 2023 20:55
@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 28, 2023

CLA assistant check
All committers have signed the CLA.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: this is actually pretty cool!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian)

Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I realized I was pulling the tenantMetricsSet with each store iteration, which was unnecessary, so I moved it outside the store registry loop.

CI was in a rough state when I originally put up this PR, so let's see if things are in a better state now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian)

Previously, while we had the ability to show tenant-level
store metrics in the `/_status/vars` page, these metrics
were never written to tsdb.

This is despite the changes in cockroachdb#98077,
which did a great job at writing all the metrics in the
tenant-specific metric registries, but didn't pull the
tenant-specific store metrics out of the store registries.
This is because these metrics exist as *child* metrics on
the store registry metrics, and we did not previously have
logic to individually pick these metrics out of their
parent AggGauge/Counter metrics.

This patch adds the logic to do so. Now, for each tenant
ID that exists in the recorder's `tenantRegistries` map,
we will attempt to pick that tenant's individual child
metric & values from all the metrics that exist in
TenantsStorageMetrics. This will limit the writing
of these tenant-level metrics to only happen in deployments
where multiple tenants are running in-process, as environments
such as serverless clusters are expected to have an empty
`tenantRegistries` map in this recorder.

Release note: none
@abarganier
Copy link
Contributor Author

(fixed an import cycle)

@abarganier
Copy link
Contributor Author

bors r=aadityasondhi

@craig
Copy link
Contributor

craig bot commented Apr 4, 2023

Build succeeded:

@craig craig bot merged commit 0d15e0d into cockroachdb:master Apr 4, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 4, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-99860 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/100636/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

craig bot pushed a commit that referenced this pull request Apr 7, 2023
99663: sql: update connExecutor logic for pausable portals r=ZhouXing19 a=ZhouXing19

This PR replaces #96358 and is part of the initial implementation of multiple active portals.

----

This PR is to add limited support for multiple active portals. Now portals satisfying all following restrictions can be paused and resumed (i.e., with other queries interleaving it):

1. Not an internal query;
2. Read-only query;
3. No sub-queries or post-queries.

And such a portal will only have the statement executed with a _non-distributed_ plan. 

This feature is gated by a session variable `multiple_active_portals_enabled`. When it's set `true`, all portals that satisfy the restrictions above will automatically become "pausable" when being created via the pgwire `Bind` stmt. 

The core idea of this implementation is 
1. Add a `switchToAnotherPortal` status to the result-consumption state machine. When we receive an `ExecPortal` message for a different portal, we simply return the control to the connExecutor. (#99052)
2. Persist `flow` `queryID` `span` and `instrumentationHelper` for the portal, and reuse it when we re-execute a portal. This is to ensure we _continue_ the fetching rather than starting all over. (#99173)
3. To enable 2, we need to delay the clean-up of resources till we close the portal. For this we introduced the stacks of cleanup functions. (This PR)

Note that we kept the implementation of the original "un-pausable" portal, as we'd like to limit this new functionality only to a small set of statements. Eventually some of them should be replaced (e.g. the limitedCommandResult's lifecycle) with the new code. 

Also, we don't support distributed plan yet, as it involves much more complicated changes. See `Start with an entirely local plan` section in the [design doc](https://docs.google.com/document/d/1SpKTrTqc4AlGWBqBNgmyXfTweUUsrlqIaSkmaXpznA8/edit). Support for this will come as a follow-up.

Epic: CRDB-17622

Release note (sql change): initial support for multiple active portals. Now with session variable `multiple_active_portals_enabled` set to true,  portals satisfying all following restrictions can be executed in an interleaving manner:  1. Not an internal query; 2. Read-only query; 3. No sub-queries or post-queries. And such a portal will only have the statement executed with an entirely local plan. 





99947: ui: small fixes to DB Console charts shown for secondary tenants r=dhartunian a=abarganier

#97995 updated the
DB Console to filter out KV-specific charts from the metrics page
when viewing DB Console as a secondary application tenant.

The PR missed a couple small details. This patch cleans those
up with the following:

- Removes KV latency charts for app tenants
- Adds a single storage graph for app tenants showing livebytes
- Removes the "Capacity" chart on the Overview dashboard for app
  tenants

Release note: none

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-12100

NB: Please only review the final commit. 1st commit is being reviewed separately @ #99860

100188: changefeedccl: pubsub sink refactor to batching sink r=rickystewart a=samiskin

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-13237

This change is a followup to #99086 which moves the Pubsub sink to the batching sink framework.

The changes involve:
1. Moves the Pubsub code to match the `SinkClient` interface, moving to using the lower level v1 pubsub API that lets us publish batches manually
3. Removing the extra call to json.Marshal
4. Moving to using the `pstest` package for validating results in unit tests
5. Adding topic handling to the batching sink, where batches are created per-topic
6. Added a pubsub_sink_config since it can now handle Retry and Flush config settings
7. Added metrics even to the old pubsub for the purpose of comparing the two versions

At default settings, this resulted in a peak of 90k messages per second on a single node with throughput at 27.6% cpu usage, putting it at a similar level to kafka.

Running pubsub v2 across all of TPCC (nodes ran out of ranges at different speeds):
<img width="637" alt="Screenshot 2023-03-30 at 3 38 25 PM" src="https://user-images.githubusercontent.com/6236424/229863386-edaee27d-9762-4806-bab6-e18b8a6169d6.png">

Running pubsub v1 (barely visible, 2k messages per second) followed by v2 on tpcc.order_line (in v2 only 2 nodes ended up having ranges assigned to them):
<img width="642" alt="Screenshot 2023-04-04 at 12 53 45 PM" src="https://user-images.githubusercontent.com/6236424/229863507-1883ea45-d8ce-437b-9b9c-550afec68752.png">

In the following graphs from the cloud console, where v1 was ran followed by v2, you can see how the main reason v1 was slow was that it wasn't able to batch different keys together.
<img width="574" alt="Screenshot 2023-04-04 at 12 59 51 PM" src="https://user-images.githubusercontent.com/6236424/229864083-758c0814-d53c-447e-84c3-471cf5d56c44.png">

Publish requests remained the same despite way more messages in v2
<img width="1150" alt="Screenshot 2023-04-04 at 1 46 51 PM" src="https://user-images.githubusercontent.com/6236424/229875314-6e07177e-62c4-4c15-b13f-f75e8143e011.png">



Release note (performance improvement): pubsub sink changefeeds can now support higher throughputs by enabling the changefeed.new_pubsub_sink_enabled cluster setting.

100620: pkg/server: move DataDistribution to systemAdminServer r=dhartunian a=abarganier

The DataDistribution endpoint reports replica counts by database and table. When it was built, it operated off the assumption that a range would only ever contain a single table's data within.

Now that we have coalesced ranges, a single range can span multiple tables. Unfortunately, the DataDistribution endpoint does not take this fact into account, meaning it reports garbled and inaccurate data, unless the `spanconfig.storage_coalesce_adjacent.enabled` setting is set to false (see #98820).

For secondary tenants, ranges are *always* coalesced, so this endpoint in its current state could never report meaningful data for a tenant.

Given all of this, we have decided to make this endpoint only available for the system tenant. This patch
accomplishes this by moving the endpoint away from the adminServer and into the systemAdminServer, making it effectively unimplemented for secondary tenants.

Release note: none

Informs: #97942

Co-authored-by: Jane Xing <zhouxing@uchicago.edu>
Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com>
Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tsdb: include TenantStorageMetrics per-tenant
3 participants