storage: time-series metrics for level size and score #88415

nicktrav · 2022-09-21T23:06:12Z

Is your feature request related to a problem? Please describe.

We've had some recent support cases where Pebble has appeared to be making suboptimal compaction picking decisions. Specifically, in the case of cockroachlabs/support#1788, compactions at lower levels were favored, despite a growing compaction debt in L0 over a period of ~30 mins, which resulted in an increasingly inverted LSM.

Knowing the size and score of each level would allow us to better identify suboptimal compaction picking and level scoring.

Describe the solution you'd like

Add new time-series metrics for level size and score.

Describe alternatives you've considered

Continue to depend on the LSM state printout (prints once every 10 mins), or on demand via the /debug/lsm endpoint.

Jira issue: CRDB-19808

Epic CRDB-20293

The text was updated successfully, but these errors were encountered:

Currently, the only way to infer the compaction score and heuristics is to use the LSM printout from the logs (emitted once every ten minutes), or to call the `/debug/lsm` endpoint manually, and track values over time. This makes it difficult to debug issues retroactively. Add two new sets of per-LSM-level time-series metrics for level size and level score. These new metrics have names of the form `storage.$LEVEL-level-{size,score}`. Add an additional enum value for metrics that are "unitless". For example, a "score". Closes cockroachdb#88415. Release note (ops change): Adds two new sets of per-LSM-level time-series metrics, one for level size and another for level score. These metrics are of the form `storage.$LEVEL-level-{size,score}`.

88395: changefeedccl: Do not block on file size based flushes r=miretskiy a=miretskiy Prior to this change, cloud storage sink trigger file sized based flush whenever new row would would push the file size beyond configured threshold. This had the effect of singificantly reducing the throughput whenever such event occured -- no additional events could be added to cloud storage sink, while the previus flush was active. This is not necessary. Cloud storage sink can trigger file based flushes asynchronously. The only requirement is that if a real, non file based, flush arrives, or if we need to emit resolved timestamps, then we must wait for all of the active flush requests to complete. In addition, because every event added to cloud sink has associate allocation, which is released when file is written out, performing flushes asynchronously is safe with respect to memory usage and accounting. Release note (enterprise change): Changefeeds, using cloud storage sink, now have better throughput. Release justification: performance fix 88504: kvserver: add storage time-series metrics for level size and score r=sumeerbhola,jbowens a=nicktrav Currently, the only way to infer the compaction score and heuristics is to use the LSM printout from the logs (emitted once every ten minutes), or to call the `/debug/lsm` endpoint manually, and track values over time. This makes it difficult to debug issues retroactively. Add two new sets of per-LSM-level time-series metrics for level size and level score. These new metrics have names of the form `storage.$LEVEL-level-{size,score}`. Closes #88415. Release note (ops change): Adds two new sets of per-LSM-level time-series metrics, one for level size and another for level score. These metrics are of the form `storage.$LEVEL-level-{size,score}`. 88509: ui: insights overview rename 'execution id' to 'latest execution id' r=j82w a=j82w The insights only shows the latest execution id per a fingerprint. Renaming the column will avoid confusion where user might expect multiple execution ids for the same fingerprint. After showing the new column name. <img width="1131" alt="Screen Shot 2022-09-22 at 4 15 12 PM" src="https://user-images.githubusercontent.com/8868107/191842420-92877707-ed8d-4eee-af00-052fd1544516.png"> <img width="1457" alt="Screen Shot 2022-09-22 at 4 14 22 PM" src="https://user-images.githubusercontent.com/8868107/191842732-8c69099d-f18a-434f-b6c5-4ce231f364cc.png"> closes #88456 Release justification: Category 2: Bug fixes and low-risk updates to new functionality Release note: (ui change): Rename insights overview table column 'execution id' to 'latest execution id'. This will help avoid confusion since the ui only shows the latest id per fingerprint. Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Nick Travers <travers@cockroachlabs.com> Co-authored-by: j82w <jwilley@cockroachlabs.com>

Currently, the only way to infer the compaction score and heuristics is to use the LSM printout from the logs (emitted once every ten minutes), or to call the `/debug/lsm` endpoint manually, and track values over time. This makes it difficult to debug issues retroactively. Add two new sets of per-LSM-level time-series metrics for level size and level score. These new metrics have names of the form `storage.$LEVEL-level-{size,score}`. Closes #88415. Release note (ops change): Adds two new sets of per-LSM-level time-series metrics, one for level size and another for level score. These metrics are of the form `storage.$LEVEL-level-{size,score}`.

nicktrav added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Sep 21, 2022

nicktrav self-assigned this Sep 22, 2022

nicktrav mentioned this issue Sep 22, 2022

kvserver: add storage time-series metrics for level size and score #88504

Merged

craig bot closed this as completed in d41cce0 Sep 23, 2022

blathers-crl bot mentioned this issue Sep 23, 2022

release-22.2: kvserver: add storage time-series metrics for level size and score #88592

Merged

nicktrav added the sync-me label Sep 26, 2022

exalate-issue-sync bot reopened this Oct 6, 2022

exalate-issue-sync bot removed the sync-me label Oct 6, 2022

nicktrav closed this as completed Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: time-series metrics for level size and score #88415

storage: time-series metrics for level size and score #88415

nicktrav commented Sep 21, 2022 •

edited by exalate-issue-sync bot

Loading

storage: time-series metrics for level size and score #88415

storage: time-series metrics for level size and score #88415

Comments

nicktrav commented Sep 21, 2022 • edited by exalate-issue-sync bot Loading

nicktrav commented Sep 21, 2022 •

edited by exalate-issue-sync bot

Loading