Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,*:state inspection pages for a cluster node #66772

Open
sumeerbhola opened this issue Jun 23, 2021 · 3 comments
Open

kv,*:state inspection pages for a cluster node #66772

sumeerbhola opened this issue Jun 23, 2021 · 3 comments
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Jun 23, 2021

(This is a tracking issue for discussion of specific ideas that can be spun off into separate issues)

We lack inspectz-style pages (google terminology) on a node, which would show a view on the current state of certain data-structures within a node. These would be used when metrics or traces have indicated that we need to look more closely at a particular node.

Possible examples: states of (explicit or implicit) queues (e.g. for queues for latches and locks) including who is waiting and for how long; current LSM state and ongoing compactions etc. These don’t need to be fast to generate since they would be used sparingly (in the worst case could take a few seconds, if the internal structure is large, and cause a few ms delay in running queries). Such pages can use filters to make the inspected data manageable e.g. filtered to a range, txnid, key range etc.

This was less important when debug.zip was the primary way to troubleshoot, but we have direct access in CC and for important customers for whom extremely short remediation time is critical.

Needless to say, deciding what state needs such a page is critical and needs to be informed by actual troubleshooting experience. The tooling around this should make it very easy to create one (i.e., any complexity should be limited to how to construct the view of the internal structure and not on how to pass in filtering parameters or display/format the output).

@jbowens raised the following in the internal slack thread:
is there some risk in having separate observability regimes for clusters that we have direct access to versus not? maybe we should think of a debug.zip as just a response format. if interacting directly with an inspectz-style UI, the UI requests a thin, filtered debug.zip of just the requested data. otherwise, we can instruct customers to generate a debug.zip with the same information, which may be loaded into the same UI

Jira issue: CRDB-8222

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-observability T-kv KV Team labels Jun 23, 2021
@jbowens
Copy link
Collaborator

jbowens commented Jun 23, 2021

In some cases there are difficulties when the internal state is too big (the filter situation I mentioned above).

To clarify, I'm suggesting that debug.zip generation supports the same type of filtering we'd do in the inspectz-style page. The turnaround is longer if we're asking a customer to run a command, so maybe narrow filtering isn't that useful without the ability to iterate and investigate in real time.

@sumeerbhola
Copy link
Collaborator Author

The tooling around this should make it very easy to create one (i.e., any complexity should be limited to how to construct the view of the internal structure and not on how to pass in filtering parameters or display/format the output).

Note that this issue is about the tooling, and not about constructing the actual pages, which given the tooling becomes usually trivial. One important part of the tooling is the ability to create a multi-column table with string and numeric types, that can be re-sorted in the end-user's browser in descending/ascending order of any column. The end-user being able to filter the table using a regexp search on a string column would be an additional plus.

@irfansharif
Copy link
Contributor

We're introducing lots of high-cardinality state as part #95563. Each node for example is maintaining token buckets per "replication stream", defined by <tenant id,store id> it's issuing replication traffic on behalf of + to. Each proposer replica itself managing a range-oriented view of active replication streams. These are hard things to observe through aggregate metrics, but inspectz style pages seem a lot more apprioriate, to zoom into in-memory state for which replication streams are blocked (due to unavailable flow tokens), which ranges are blocked and due to which replicas specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

3 participants