Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug: add LSM health explicitly to the debug zip #79518

Closed
nicktrav opened this issue Apr 6, 2022 · 3 comments · Fixed by #125865
Closed

debug: add LSM health explicitly to the debug zip #79518

nicktrav opened this issue Apr 6, 2022 · 3 comments · Fixed by #125865
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. E-starter Might be suitable for a starter project for new employees or team members. T-storage Storage Team

Comments

@nicktrav
Copy link
Collaborator

nicktrav commented Apr 6, 2022

Pebble exposes a view of the LSM, which we expose in the Cockroach console via the /debug/lsm endpoint.

When debugging storage-related issues, we often want to know the state of the LSM at the time when the debug.zip was taken. To do this, we typically dig through logs, looking for the last instance of the LSM printout, which could be up to 10 minutes in the past.

To quicken the process of diagnosing issues related to storage health, we should consider adding the LSM health to the debug zip as its own dedicated file, for each node. For instance, under nodes/$N/lsm.txt, we can dump the same output from the /debug/lsm endpoint.

For historical LSM metrics, the logs can still be used, in addition to the metrics captured in the tsdump (l0 size, l0 file count, etc.).

Jira issue: CRDB-14895

@nicktrav nicktrav added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team labels Apr 6, 2022
@nicktrav nicktrav added the O-postmortem Originated from a Postmortem action item. label Apr 6, 2022
@nicktrav nicktrav self-assigned this Apr 6, 2022
@rail
Copy link
Member

rail commented May 25, 2022

Manually synced with Jira

@jlinder jlinder removed the sync-me-3 label May 27, 2022
@nicktrav nicktrav added E-starter Might be suitable for a starter project for new employees or team members. E-quick-win Likely to be a quick win for someone experienced. labels Jun 10, 2022
@nicktrav
Copy link
Collaborator Author

nicktrav commented Mar 7, 2023

This is under consideration for 23.2. Removing O-postmortem Originated from a Postmortem action item. .

@nicktrav nicktrav removed the O-postmortem Originated from a Postmortem action item. label Mar 7, 2023
@nicktrav
Copy link
Collaborator Author

One possible impl:

  • we have engine stats, which is in the debug.zip, which is a holdover from RocksDB
  • engine stats currently returns nothing, and we were considering removing it
  • rather than removing the endpoint, repurpose it so as to include the LSM stats

@jbowens jbowens added this to Storage Jun 4, 2024
@jbowens jbowens moved this to Tactical Wins in Storage Jun 4, 2024
anish-shanbhag added a commit to anish-shanbhag/cockroach that referenced this issue Jun 18, 2024
Metrics from the storage engine are already exposed in the
`/debug/lsm` HTTP endpoint. These can be useful when debugging storage
issues, and so this change adds these metrics to the debug zip under
`/nodes/$N/lsm.txt` in the same text format as the HTTP route. The
previously unused `EngineStats` status endpoint was repurposed to
serve these metrics from each node.

Fixes: cockroachdb#79518
Epic: none
Release note: none
craig bot pushed a commit that referenced this issue Jun 24, 2024
125865: cli: add storage engine metrics to debug zip r=itsbilal a=anish-shanbhag

Metrics from the storage engine are already exposed in the `/debug/lsm` HTTP endpoint. These can be useful when debugging storage issues, and so this change adds these metrics to the debug zip under `/nodes/$N/lsm.txt` in the same text format as the HTTP route. The previously unused `EngineStats` status endpoint was repurposed to serve these metrics from each node.

Fixes: #79518
Epic: none
Release note: none

126087: roachtest: improve logging in gossip/chaos/nodes=9 further r=nvanbenschoten a=nvanbenschoten

Informs #124828.
Informs #126077.

Release note: None

Co-authored-by: Anish Shanbhag <anish.shanbhag@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in cb2133f Jun 24, 2024
@github-project-automation github-project-automation bot moved this from Tactical Wins to Done in Storage Jun 24, 2024
asg0451 pushed a commit to asg0451/cockroach that referenced this issue Jun 25, 2024
Metrics from the storage engine are already exposed in the
`/debug/lsm` HTTP endpoint. These can be useful when debugging storage
issues, and so this change adds these metrics to the debug zip under
`/nodes/$N/lsm.txt` in the same text format as the HTTP route. The
previously unused `EngineStats` status endpoint was repurposed to
serve these metrics from each node.

Fixes: cockroachdb#79518
Epic: none
Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. E-starter Might be suitable for a starter project for new employees or team members. T-storage Storage Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants