kv,*:state inspection pages for a cluster node #66772
Labels
A-kv-observability
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
(This is a tracking issue for discussion of specific ideas that can be spun off into separate issues)
We lack inspectz-style pages (google terminology) on a node, which would show a view on the current state of certain data-structures within a node. These would be used when metrics or traces have indicated that we need to look more closely at a particular node.
Possible examples: states of (explicit or implicit) queues (e.g. for queues for latches and locks) including who is waiting and for how long; current LSM state and ongoing compactions etc. These don’t need to be fast to generate since they would be used sparingly (in the worst case could take a few seconds, if the internal structure is large, and cause a few ms delay in running queries). Such pages can use filters to make the inspected data manageable e.g. filtered to a range, txnid, key range etc.
This was less important when debug.zip was the primary way to troubleshoot, but we have direct access in CC and for important customers for whom extremely short remediation time is critical.
Needless to say, deciding what state needs such a page is critical and needs to be informed by actual troubleshooting experience. The tooling around this should make it very easy to create one (i.e., any complexity should be limited to how to construct the view of the internal structure and not on how to pass in filtering parameters or display/format the output).
@jbowens raised the following in the internal slack thread:
is there some risk in having separate observability regimes for clusters that we have direct access to versus not? maybe we should think of a debug.zip as just a response format. if interacting directly with an inspectz-style UI, the UI requests a thin, filtered debug.zip of just the requested data. otherwise, we can instruct customers to generate a debug.zip with the same information, which may be loaded into the same UI
Jira issue: CRDB-8222
The text was updated successfully, but these errors were encountered: