Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsdb: include TenantStorageMetrics per-tenant #99228

Closed
abarganier opened this issue Mar 22, 2023 · 4 comments · Fixed by #99860
Closed

tsdb: include TenantStorageMetrics per-tenant #99228

abarganier opened this issue Mar 22, 2023 · 4 comments · Fixed by #99860
Assignees
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@abarganier
Copy link
Contributor

abarganier commented Mar 22, 2023

Is your feature request related to a problem? Please describe.
TenantStorageMetrics appears to record storage level metrics at the tenant level. However, our recent work to update the MetricsRecorder with tenant registries does not capture these metrics.

These metrics neither exist in the the store level registries within the recorder, because the TenantStorageMetrics exist outside of the StoreMetrics registry.

As it stands today, when we register store-level metrics with the MetricsRecorder, we don't include any of these TenantStorageMetrics from what I can tell.

Because of this, tenant-level storage metrics that a tenant would naturally want to see, such as livebytes, keybytes, etc. are not available in TSDB (and therefore, neither in DB Console).

An app tenant will naturally want to know how much live data they have in their tenant DB, so we will need to find a way to record this information into TSDB so it can be exposed in DB Console.

Describe the solution you'd like
Find a way to get these tenant-level storage metrics into TSDB, so that app tenants can view metrics such as livebytes in DB Console.

Jira issue: CRDB-25773

@abarganier abarganier added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels Mar 22, 2023
@abarganier abarganier self-assigned this Mar 22, 2023
tbg added a commit to tbg/cockroach that referenced this issue Mar 23, 2023
TODO needs some sort of test, perhaps.

Closes cockroachdb#99228

Epic: none
Release note: None
@tbg
Copy link
Member

tbg commented Mar 23, 2023

Don't we just need #99353? I don't have all the internals paged in but it seems that once the registry knows about the metrics, everything else should "just work" (since we already know how to record tenants to tsdb, #98077)?

@tbg
Copy link
Member

tbg commented Mar 23, 2023

I take that back (and closed that PR). We already add the StoreMetrics to the registry, and since the TenantStoreMetrics is within, it is getting picked up:

storeRegistry.AddMetricStruct(sm)

It's clear that this is working, since we see all of these metrics exported on the prometheus endpoint.

@tbg
Copy link
Member

tbg commented Mar 23, 2023

and with set cluster setting server.child_metrics.enabled = true; we also see the actual per-tenant metrics:

# HELP rangekeycount Count of all range keys (e.g. MVCC range tombstones)
# TYPE rangekeycount gauge
rangekeycount{store="1",node_id="1"} 0
rangekeycount{store="1",node_id="1",tenant_id="123"} 0
rangekeycount{store="1",node_id="1",tenant_id="124"} 0
rangekeycount{store="1",node_id="1",tenant_id="system"} 0

@tbg
Copy link
Member

tbg commented Mar 23, 2023

So I think this is all working, it's just a different paradigm than the one here:

// Record time series from node-level registries for secondary tenants.
for tenantID, r := range mr.mu.tenantRegistries {
tenantRecorder := registryRecorder{
registry: r,
format: nodeTimeSeriesPrefix,
source: tsutil.MakeTenantSource(mr.mu.desc.NodeID.String(), tenantID.String()),
timestampNanos: now.UnixNano(),
}
tenantRecorder.record(&data)
}

The snippet above works for metrics that exist "only" for the tenant (i.e. are not aggregated across tenants). But we don't have a *Registry per tenant in the KV server, and the set of tenants is not static - we collect metrics for any tenant for which we have a range (and even some extra ones, for example if we have tenant 123 but not 124, we'll still track something for 124 because there is a range split at the end of 123's keyspace).

Instead, we use aggmetrics. You basically have a top-level metric which tracks the sum of all of its children, each child being a tenant that we collect metrics for. The top-level metrics are pretty fundamental KV metrics, like how many bytes are in the keyspace, etc.

These children are currently surfaced on the prometheus endpoint when the server.child_metrics.enabled cluster setting is set1.

I think we could still use the tenantRegistries here:

  • we update tenantStoreMetrics2 (i.e. the thing we have in kvserver for each tenant for which we have replicas) to also have a *Registry in it that has all of the metrics on the struct registered to it
  • any mutation to the map holding the tenantStoreMetrics is mirrored into the MetricsRecorder3 (which would need to be passed down via an interface)

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/2353a6a25a9750188a3a02dfee067fe2601b4278/pkg/server/status/recorder.go#L289-L296

  2. https://github.com/cockroachdb/cockroach/blob/f33f404ebaa6c6f092c4ef47a36380c00236f341/pkg/kv/kvserver/metrics.go#L2254

  3. https://github.com/cockroachdb/cockroach/blob/2353a6a25a9750188a3a02dfee067fe2601b4278/pkg/server/status/recorder.go#L185-L199

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
2 participants