Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: time-series metrics for level size and score #88415

Closed
nicktrav opened this issue Sep 21, 2022 · 0 comments · Fixed by #88504
Closed

storage: time-series metrics for level size and score #88415

nicktrav opened this issue Sep 21, 2022 · 0 comments · Fixed by #88504
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@nicktrav
Copy link
Collaborator

nicktrav commented Sep 21, 2022

Is your feature request related to a problem? Please describe.

We've had some recent support cases where Pebble has appeared to be making suboptimal compaction picking decisions. Specifically, in the case of cockroachlabs/support#1788, compactions at lower levels were favored, despite a growing compaction debt in L0 over a period of ~30 mins, which resulted in an increasingly inverted LSM.

Knowing the size and score of each level would allow us to better identify suboptimal compaction picking and level scoring.

Describe the solution you'd like

Add new time-series metrics for level size and score.

Describe alternatives you've considered

Continue to depend on the LSM state printout (prints once every 10 mins), or on demand via the /debug/lsm endpoint.

Jira issue: CRDB-19808

Epic CRDB-20293

@nicktrav nicktrav added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Sep 21, 2022
@nicktrav nicktrav self-assigned this Sep 22, 2022
nicktrav added a commit to nicktrav/cockroach that referenced this issue Sep 22, 2022
Currently, the only way to infer the compaction score and heuristics is
to use the LSM printout from the logs (emitted once every ten minutes),
or to call the `/debug/lsm` endpoint manually, and track values over
time. This makes it difficult to debug issues retroactively.

Add two new sets of per-LSM-level time-series metrics for level size and
level score. These new metrics have names of the form
`storage.$LEVEL-level-{size,score}`.

Add an additional enum value for metrics that are "unitless". For
example, a "score".

Closes cockroachdb#88415.

Release note (ops change): Adds two new sets of per-LSM-level
time-series metrics, one for level size and another for level score.
These metrics are of the form `storage.$LEVEL-level-{size,score}`.
craig bot pushed a commit that referenced this issue Sep 23, 2022
88395: changefeedccl: Do not block on file size based flushes r=miretskiy a=miretskiy

Prior to this change, cloud storage sink trigger
file sized based flush whenever new row would
would push the file size beyond configured threshold.

This had the effect of singificantly reducing the throughput whenever such event occured -- no additional events could be added to cloud storage sink, while the previus flush was active.

This is not necessary.  Cloud storage sink can trigger file based flushes asynchronously.  The only requirement is that if a real, non file based, flush arrives, or if we need to emit resolved timestamps, then we must wait for all of the active flush requests to complete.

In addition, because every event added to cloud sink has associate allocation, which is released when file is written out, performing flushes asynchronously is safe with respect to memory usage and accounting.

Release note (enterprise change): Changefeeds, using cloud storage sink, now have better throughput.
Release justification: performance fix

88504: kvserver: add storage time-series metrics for level size and score r=sumeerbhola,jbowens a=nicktrav

Currently, the only way to infer the compaction score and heuristics is to use the LSM printout from the logs (emitted once every ten minutes), or to call the `/debug/lsm` endpoint manually, and track values over time. This makes it difficult to debug issues retroactively.

Add two new sets of per-LSM-level time-series metrics for level size and level score. These new metrics have names of the form `storage.$LEVEL-level-{size,score}`.

Closes #88415.

Release note (ops change): Adds two new sets of per-LSM-level time-series metrics, one for level size and another for level score. These metrics are of the form `storage.$LEVEL-level-{size,score}`.

88509: ui: insights overview rename 'execution id' to 'latest execution id' r=j82w a=j82w

The insights only shows the latest execution id per a fingerprint. Renaming the column will avoid confusion where user might expect multiple execution ids for the same fingerprint. 

After showing the new column name.
<img width="1131" alt="Screen Shot 2022-09-22 at 4 15 12 PM" src="https://user-images.githubusercontent.com/8868107/191842420-92877707-ed8d-4eee-af00-052fd1544516.png">
<img width="1457" alt="Screen Shot 2022-09-22 at 4 14 22 PM" src="https://user-images.githubusercontent.com/8868107/191842732-8c69099d-f18a-434f-b6c5-4ce231f364cc.png">

closes #88456

Release justification: Category 2: Bug fixes and
low-risk updates to new functionality

Release note: (ui change): Rename insights overview table column 'execution id' to 'latest execution id'. This will help avoid confusion since the ui only shows the latest id per fingerprint.

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
Co-authored-by: j82w <jwilley@cockroachlabs.com>
@craig craig bot closed this as completed in d41cce0 Sep 23, 2022
blathers-crl bot pushed a commit that referenced this issue Sep 23, 2022
Currently, the only way to infer the compaction score and heuristics is
to use the LSM printout from the logs (emitted once every ten minutes),
or to call the `/debug/lsm` endpoint manually, and track values over
time. This makes it difficult to debug issues retroactively.

Add two new sets of per-LSM-level time-series metrics for level size and
level score. These new metrics have names of the form
`storage.$LEVEL-level-{size,score}`.

Closes #88415.

Release note (ops change): Adds two new sets of per-LSM-level
time-series metrics, one for level size and another for level score.
These metrics are of the form `storage.$LEVEL-level-{size,score}`.
@exalate-issue-sync exalate-issue-sync bot reopened this Oct 6, 2022
@nicktrav nicktrav closed this as completed Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant