-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: time-series metrics for level size and score #88415
Labels
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-storage
Storage Team
Comments
nicktrav
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-storage
Relating to our storage engine (Pebble) on-disk storage.
T-storage
Storage Team
labels
Sep 21, 2022
nicktrav
added a commit
to nicktrav/cockroach
that referenced
this issue
Sep 22, 2022
Currently, the only way to infer the compaction score and heuristics is to use the LSM printout from the logs (emitted once every ten minutes), or to call the `/debug/lsm` endpoint manually, and track values over time. This makes it difficult to debug issues retroactively. Add two new sets of per-LSM-level time-series metrics for level size and level score. These new metrics have names of the form `storage.$LEVEL-level-{size,score}`. Add an additional enum value for metrics that are "unitless". For example, a "score". Closes cockroachdb#88415. Release note (ops change): Adds two new sets of per-LSM-level time-series metrics, one for level size and another for level score. These metrics are of the form `storage.$LEVEL-level-{size,score}`.
craig bot
pushed a commit
that referenced
this issue
Sep 23, 2022
88395: changefeedccl: Do not block on file size based flushes r=miretskiy a=miretskiy Prior to this change, cloud storage sink trigger file sized based flush whenever new row would would push the file size beyond configured threshold. This had the effect of singificantly reducing the throughput whenever such event occured -- no additional events could be added to cloud storage sink, while the previus flush was active. This is not necessary. Cloud storage sink can trigger file based flushes asynchronously. The only requirement is that if a real, non file based, flush arrives, or if we need to emit resolved timestamps, then we must wait for all of the active flush requests to complete. In addition, because every event added to cloud sink has associate allocation, which is released when file is written out, performing flushes asynchronously is safe with respect to memory usage and accounting. Release note (enterprise change): Changefeeds, using cloud storage sink, now have better throughput. Release justification: performance fix 88504: kvserver: add storage time-series metrics for level size and score r=sumeerbhola,jbowens a=nicktrav Currently, the only way to infer the compaction score and heuristics is to use the LSM printout from the logs (emitted once every ten minutes), or to call the `/debug/lsm` endpoint manually, and track values over time. This makes it difficult to debug issues retroactively. Add two new sets of per-LSM-level time-series metrics for level size and level score. These new metrics have names of the form `storage.$LEVEL-level-{size,score}`. Closes #88415. Release note (ops change): Adds two new sets of per-LSM-level time-series metrics, one for level size and another for level score. These metrics are of the form `storage.$LEVEL-level-{size,score}`. 88509: ui: insights overview rename 'execution id' to 'latest execution id' r=j82w a=j82w The insights only shows the latest execution id per a fingerprint. Renaming the column will avoid confusion where user might expect multiple execution ids for the same fingerprint. After showing the new column name. <img width="1131" alt="Screen Shot 2022-09-22 at 4 15 12 PM" src="https://user-images.githubusercontent.com/8868107/191842420-92877707-ed8d-4eee-af00-052fd1544516.png"> <img width="1457" alt="Screen Shot 2022-09-22 at 4 14 22 PM" src="https://user-images.githubusercontent.com/8868107/191842732-8c69099d-f18a-434f-b6c5-4ce231f364cc.png"> closes #88456 Release justification: Category 2: Bug fixes and low-risk updates to new functionality Release note: (ui change): Rename insights overview table column 'execution id' to 'latest execution id'. This will help avoid confusion since the ui only shows the latest id per fingerprint. Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Nick Travers <travers@cockroachlabs.com> Co-authored-by: j82w <jwilley@cockroachlabs.com>
blathers-crl bot
pushed a commit
that referenced
this issue
Sep 23, 2022
Currently, the only way to infer the compaction score and heuristics is to use the LSM printout from the logs (emitted once every ten minutes), or to call the `/debug/lsm` endpoint manually, and track values over time. This makes it difficult to debug issues retroactively. Add two new sets of per-LSM-level time-series metrics for level size and level score. These new metrics have names of the form `storage.$LEVEL-level-{size,score}`. Closes #88415. Release note (ops change): Adds two new sets of per-LSM-level time-series metrics, one for level size and another for level score. These metrics are of the form `storage.$LEVEL-level-{size,score}`.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-storage
Storage Team
Is your feature request related to a problem? Please describe.
We've had some recent support cases where Pebble has appeared to be making suboptimal compaction picking decisions. Specifically, in the case of cockroachlabs/support#1788, compactions at lower levels were favored, despite a growing compaction debt in L0 over a period of ~30 mins, which resulted in an increasingly inverted LSM.
Knowing the size and score of each level would allow us to better identify suboptimal compaction picking and level scoring.
Describe the solution you'd like
Add new time-series metrics for level size and score.
Describe alternatives you've considered
Continue to depend on the LSM state printout (prints once every 10 mins), or on demand via the
/debug/lsm
endpoint.Jira issue: CRDB-19808
Epic CRDB-20293
The text was updated successfully, but these errors were encountered: