Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: add storage engine metrics to debug zip #125865

Merged
merged 1 commit into from
Jun 24, 2024

Conversation

anish-shanbhag
Copy link
Contributor

Metrics from the storage engine are already exposed in the /debug/lsm HTTP endpoint. These can be useful when debugging storage issues, and so this change adds these metrics to the debug zip under /nodes/$N/lsm.txt in the same text format as the HTTP route. The previously unused EngineStats status endpoint was repurposed to serve these metrics from each node.

Fixes: #79518
Epic: none
Release note: none

@anish-shanbhag anish-shanbhag requested review from a team and sumeerbhola June 18, 2024 17:56
@anish-shanbhag anish-shanbhag requested review from a team as code owners June 18, 2024 17:56
@anish-shanbhag anish-shanbhag requested review from xinhaoz and removed request for a team June 18, 2024 17:56
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@anish-shanbhag anish-shanbhag requested review from itsbilal and removed request for sumeerbhola June 18, 2024 17:57
@anish-shanbhag anish-shanbhag force-pushed the lsm-debug-zip branch 3 times, most recently from fccece2 to de5b22f Compare June 18, 2024 21:16
Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Broadly I like the approach here

@@ -221,28 +220,33 @@ func (ds *Server) RegisterWorkloadCollector(stores *kvserver.Stores) error {
return nil
}

// GetEngineMetrics generates a string combining debug metrics for all of the provided storage engines.
func GetEngineMetrics(engines []storage.Engine) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should call this GetLSMStats instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is fixed

message EngineStatsRequest {
// node_id is a string so that "local" can be used to specify that no
// forwarding is necessary.
string node_id = 1;
}

message EngineStatsResponse {
repeated EngineStatsInfo stats = 1 [ (gogoproto.nullable) = false ];
// contains LSM metrics for all stores on the requested node
string metrics = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're changing the protobuf anyway, I'd have a slight preference for repeated string metrics [(gogoproto.nullable=false)], that way we have more infrastructure ready for per-store responses if we ever move in that direction. You can move the concatenation logic into a helper that can then be used by debug.GetEngineStats.

Also convention with changing protobufs is that the new field doens't use the same argument ID (the = 1 at the end) as the old field. You might need to make this 2, and you might also need a reserved 1 above it for the missing 1, but maybe that last part isn't necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling out the argument ID convention - I've updated to use = 2.

I also opted to use a store ID -> string map rather than a list of strings so that we can filter by the store ID if needed in the future.

@anish-shanbhag
Copy link
Contributor Author

anish-shanbhag commented Jun 20, 2024

I did some more investigation into whether we can just make a request to /debug/lsm to avoid needing an EngineStats endpoint altogether, but I couldn't find anywhere in the zip creation flow where we directly make a request to the debug server to retrieve info. The reference to debug events which I think we saw here isn't actually a call to a debug endpoint, but rather just indicates that events should be stored in debug/events.json. The events themselves are retrieved using a call to the admin.Events gRPC endpoint.

Since the debug zip CLI command is likely to be run on a different host than the node itself, it seems that we would have to find the node's IP in order to make a request to /debug/lsm. I'm not sure how this would look, since it doesn't look like this pattern is used for the other files in the zip.

I also looked into whether we can just include these LSM stats as a field inside the StoreStatus object referenced in #100825, which would also eliminate the need for EngineStats. I think a possible approach could be adding a TODOEngine() storage.Engine field to the storeMetrics interface here which would let us generate the LSM stats via storeMetrics.TODOEngine().GetMetrics().String(). The downside is that these store statuses seem to only be updated once every few seconds rather than generated on demand. Given this, do we think it's best to stick with the current approach with EngineStats, or is it worth trying to consolidate these endpoints despite the slightly stale results?

@itsbilal
Copy link
Member

@anish-shanbhag Thanks for looking into the debug endpoints! Given all of what you pointed out, especially the ability to dial other nodes and the ability to call the status server over grpc in addition to http, it makes sense to just use EngineStats for this.

I'm broadly in support of the approach you have here, and we can scope this change to just 1) updating EngineStats to return the LSM stats for each store, 2) pulling EngineStats and adding them to the debug zip. Once this change is merged, we should update #100825 to reflect the update to EngineStats, but otherwise leave that as a standalone issue.

Metrics from the storage engine are already exposed in the
`/debug/lsm` HTTP endpoint. These can be useful when debugging storage
issues, and so this change adds these metrics to the debug zip under
`/nodes/$N/lsm.txt` in the same text format as the HTTP route. The
previously unused `EngineStats` status endpoint was repurposed to
serve these metrics from each node.

Fixes: cockroachdb#79518
Epic: none
Release note: none
Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work!

@anish-shanbhag
Copy link
Contributor Author

TFTR!

bors r=itsbilal

@craig craig bot merged commit 92b3c59 into cockroachdb:master Jun 24, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

debug: add LSM health explicitly to the debug zip
3 participants