-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: additional observability #82743
Comments
This has proven extremely valuable to do in internal AC-related experiments (re: #75066). https://github.com/irfansharif/cockroach/tree/220614.export-tracing is a prototype that grafts together the prometheus-compatible data from https://github.com/prometheus/client_golang/blob/main/prometheus/go_collector_latest.go, and looks as follows: Through it we were able to correlate foreground latency spikes to Go scheduler latency spikes. |
From an internal doc, re: "Information needed from Go runtime":
IIUC, this is exactly the total sum of everything captured within |
We need this to make sense of mixed workload behavior (e.g. conversation in https://cockroachlabs.slack.com/archives/C038JEXC5AT/p1658247509643359?thread_ts=1657630075.576439&cid=C038JEXC5AT) |
Adding Andrew here too to pick through the list within the next two weeks, it'll be a good way to get our feet wet. |
@andrewbaptist: I'm working on the "Exporting Go's /sched/latencies:seconds" as a histogram. Want to take on the remaining? |
And record data into CRDB's internal time-series database. Informs \cockroachdb#82743 and cockroachdb#87823. To export scheduling latencies to prometheus, we choose an exponential bucketing scheme with base multiple of 1.1, and the output range bounded to [50us, 100ms). This makes for ~70 buckets. It's worth noting that the default histogram buckets used in Go are not fit for our purposes. If we care about improving it, we could consider patching the runtime. bucket[ 0] width=0s boundary=[-Inf, 0s) bucket[ 1] width=1ns boundary=[0s, 1ns) bucket[ 2] width=1ns boundary=[1ns, 2ns) bucket[ 3] width=1ns boundary=[2ns, 3ns) bucket[ 4] width=1ns boundary=[3ns, 4ns) ... bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs) bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs) bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms) bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms) bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms) ... bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s) bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s) bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s) bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf) Release note: None
And record data into CRDB's internal time-series database. Informs \cockroachdb#82743 and cockroachdb#87823. To export scheduling latencies to prometheus, we choose an exponential bucketing scheme with base multiple of 1.1, and the output range bounded to [50us, 100ms). This makes for ~70 buckets. It's worth noting that the default histogram buckets used in Go are not fit for our purposes. If we care about improving it, we could consider patching the runtime. bucket[ 0] width=0s boundary=[-Inf, 0s) bucket[ 1] width=1ns boundary=[0s, 1ns) bucket[ 2] width=1ns boundary=[1ns, 2ns) bucket[ 3] width=1ns boundary=[2ns, 3ns) bucket[ 4] width=1ns boundary=[3ns, 4ns) ... bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs) bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs) bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms) bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms) bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms) ... bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s) bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s) bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s) bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf) Release note: None
And record data into CRDB's internal time-series database. Informs \cockroachdb#82743 and cockroachdb#87823. To export scheduling latencies to prometheus, we choose an exponential bucketing scheme with base multiple of 1.1, and the output range bounded to [50us, 100ms). This makes for ~70 buckets. It's worth noting that the default histogram buckets used in Go are not fit for our purposes. If we care about improving it, we could consider patching the runtime. bucket[ 0] width=0s boundary=[-Inf, 0s) bucket[ 1] width=1ns boundary=[0s, 1ns) bucket[ 2] width=1ns boundary=[1ns, 2ns) bucket[ 3] width=1ns boundary=[2ns, 3ns) bucket[ 4] width=1ns boundary=[3ns, 4ns) ... bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs) bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs) bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms) bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms) bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms) ... bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s) bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s) bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s) bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf) Release note: None
And record data into CRDB's internal time-series database. Informs \cockroachdb#82743 and cockroachdb#87823. To export scheduling latencies to prometheus, we choose an exponential bucketing scheme with base multiple of 1.1, and the output range bounded to [50us, 100ms). This makes for ~70 buckets. It's worth noting that the default histogram buckets used in Go are not fit for our purposes. If we care about improving it, we could consider patching the runtime. bucket[ 0] width=0s boundary=[-Inf, 0s) bucket[ 1] width=1ns boundary=[0s, 1ns) bucket[ 2] width=1ns boundary=[1ns, 2ns) bucket[ 3] width=1ns boundary=[2ns, 3ns) bucket[ 4] width=1ns boundary=[3ns, 4ns) ... bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs) bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs) bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms) bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms) bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms) ... bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s) bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s) bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s) bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf) Release note: None
86277: eventpb: add storage event types r=jbowens,sumeerbhola a=nicktrav Add the `StoreStats` event type, a per-store event emitted to the `TELEMETRY` logging channel. This event type will be computed from the Pebble metrics for each store. Emit a `StoreStats` event periodically, by default, once per hour, per store. Touches #85589. Release note: None. Release justification: low risk, high benefit changes to existing functionality. 87142: workload/mixed-version/schemachanger: re-enable mixed version workload r=fqazi a=fqazi Fixes: #58489 #87477 Previously the mixed version schema changer workload was disabled because of the lack of version gates. These changes will do the following: - Start reporting errors on this workload again. - Disable trigrams in a mixed version state. - Disable the insert part of the workload in a mixed version state (there is an optimizer on 22.1 that can cause some of the queries to fail) Release justification: low risk only extends test coverage 87883: schedulerlatency: export Go scheduling latency metric r=irfansharif a=irfansharif And record data into CRDB's internal time-series database. Informs \#82743 and #87823. To export scheduling latencies to prometheus, we choose an exponential bucketing scheme with base multiple of 1.1, and the output range bounded to [50us, 100ms). This makes for ~70 buckets. It's worth noting that the default histogram buckets used in Go are not fit for our purposes. If we care about improving it, we could consider patching the runtime. ``` bucket[ 0] width=0s boundary=[-Inf, 0s) bucket[ 1] width=1ns boundary=[0s, 1ns) bucket[ 2] width=1ns boundary=[1ns, 2ns) bucket[ 3] width=1ns boundary=[2ns, 3ns) bucket[ 4] width=1ns boundary=[3ns, 4ns) ... bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs) bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs) bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms) bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms) bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms) ... bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s) bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s) bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s) bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf) ``` Release note: None Release justification: observability-only PR, low-risk high-benefit; would help understand admission control out in the wild 88179: ui/cluster-ui: fix no most recent stmt for active txns r=xinhaoz a=xinhaoz Fixes #87738 Previously, active txns could have an empty 'Most Recent Statement' column, even if their executed statement count was non-zero. This was due to the most recent query text being populated by the active stmt, which could be empty at the time of querying. This commit populates the last statement text for a txn even when it is not currently executing a query. This commit also removes the `isFullScan` field from active txn pages, as we cannot fill this field out without all stmts in the txn. Release note (ui change): Full scan field is removed from active txn details page. Release note (bug fix): active txns with non-zero executed statement count now always have populated stmt text, even when no stmt is being executed. 88334: kvserver: align Raft recv/send queue sizes r=erikgrinaker a=pavelkalinnikov Fixes #87465 Release justification: performance fix Release note: Made sending and receiving Raft queue sizes match. Previously the receiver could unnecessarily drop messages in situations when the sending queue is bigger than the receiving one. Co-authored-by: Nick Travers <travers@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
And record data into CRDB's internal time-series database. Informs \#82743 and #87823. To export scheduling latencies to prometheus, we choose an exponential bucketing scheme with base multiple of 1.1, and the output range bounded to [50us, 100ms). This makes for ~70 buckets. It's worth noting that the default histogram buckets used in Go are not fit for our purposes. If we care about improving it, we could consider patching the runtime. bucket[ 0] width=0s boundary=[-Inf, 0s) bucket[ 1] width=1ns boundary=[0s, 1ns) bucket[ 2] width=1ns boundary=[1ns, 2ns) bucket[ 3] width=1ns boundary=[2ns, 3ns) bucket[ 4] width=1ns boundary=[3ns, 4ns) ... bucket[270] width=16.384µs boundary=[737.28µs, 753.664µs) bucket[271] width=16.384µs boundary=[753.664µs, 770.048µs) bucket[272] width=278.528µs boundary=[770.048µs, 1.048576ms) bucket[273] width=32.768µs boundary=[1.048576ms, 1.081344ms) bucket[274] width=32.768µs boundary=[1.081344ms, 1.114112ms) ... bucket[717] width=1h13m18.046511104s boundary=[53h45m14.046488576s, 54h58m32.09299968s) bucket[718] width=1h13m18.046511104s boundary=[54h58m32.09299968s, 56h11m50.139510784s) bucket[719] width=1h13m18.046511104s boundary=[56h11m50.139510784s, 57h25m8.186021888s) bucket[720] width=57h25m8.186021888s boundary=[57h25m8.186021888s, +Inf) Release note: None
This commit splits the WorkQueueMetric stats into priority. Because there are a lot of new stats that might be generated as a result of this splitting, only the "important" priorities are collected and considered for each queue. Informs cockroachdb#82743. Release note: None
This commit splits the WorkQueueMetric stats into priority. Because there are a lot of new stats that might be generated as a result of this splitting, only the "important" priorities are collected and considered for each queue. Informs cockroachdb#82743. Release note: None
This commit splits the WorkQueueMetric stats into priority. Because there are a lot of new stats that might be generated as a result of this splitting, only the "important" priorities are collected and considered for each queue. Informs cockroachdb#82743. Release note: None
Part of cockroachdb#82743. We add cluster settings to control: - smoothing alpha for byte token computations; - reduction factor for L0 compaction tokens, based on observed compactions; We've found these to be useful in internal experiments, and also when looking to paper over L0 compaction variability effects up in AC. While here, print out observed smoothed compaction bytes in io_load_listener logging and introduce metrics for - l0 compacted bytes; - generated l0 tokens; - l0 tokens returned. Release note: None
Part of cockroachdb#82743. We add cluster settings to control: - smoothing alpha for byte token computations; - reduction factor for L0 compaction tokens, based on observed compactions; We've found these to be useful in internal experiments, and also when looking to paper over L0 compaction variability effects up in AC. While here, print out observed smoothed compaction bytes in io_load_listener logging and introduce metrics for - l0 compacted bytes; - generated l0 tokens; - l0 tokens returned. Release note: None
Part of cockroachdb#82743. We introduce metrics for l0 compacted bytes, generated l0 tokens, and l0 tokens returned. Release note: None
Part of cockroachdb#82743. We add cluster settings to control: - smoothing alpha for byte token computations; - reduction factor for L0 compaction tokens, based on observed compactions; We've found these to be useful in internal experiments, and also when looking to paper over L0 compaction variability effects up in AC. Release note: None
We previously did not record anything into {IO,CPU} wait queue histograms when work either bypassed admission control (because of the nature of the work, or when certain admission queues were disabled through cluster settings). This meant that our histogram percentiles were not accurate. This problem didn't exist at the flow control level where work may not be subject to flow control depending on the mode selected ('apply_to_elastic', 'apply_to_all'). We'd still record a measured wait duration (~0ms), so we had accurate waiting-for-flow-tokens histograms. Part of cockroachdb#82743. Release note: None
We previously did not record anything into {IO,CPU} wait queue histograms when work either bypassed admission control (because of the nature of the work, or when certain admission queues were disabled through cluster settings) or used the fast-path (i.e. didn't add themselves to the tenant heaps). This meant that our histogram percentiles were not accurate. NB: This problem didn't exist at the flow control level where work may not be subject to flow control depending on the mode selected ('apply_to_elastic', 'apply_to_all'). We'd still record a measured wait duration (~0ms), so we had accurate waiting-for-flow-tokens histograms. Part of cockroachdb#82743. Release note: None
110060: admission: fix wait queue histograms r=irfansharif a=irfansharif We previously did not record anything into {IO,CPU} wait queue histograms when work either bypassed admission control (because of the nature of the work, or when certain admission queues were disabled through cluster settings). This meant that our histogram percentiles were not accurate. This problem didn't exist at the flow control level where work may not be subject to flow control depending on the mode selected ('apply_to_elastic', 'apply_to_all'). We'd still record a measured wait duration (~0ms), so we had accurate waiting-for-flow-tokens histograms. Part of #82743. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Part of cockroachdb#82743. We introduce an admission.granter.io_tokens_bypassed.kv metric, that tracks the total number of tokens taken by work bypassing admission control. For example, follower writes without flow control. Aside: cockroachdb#109640 ripped out a tokens-taken-without-permission metric that was supposed to capture some of this, but even for standard admission work we'd routinely exercise that code path. When admitting work, we take 1 token, and later take the remaining without permission. Release note: None
109833: kvflowcontrol: annotate/fix perf regressions r=irfansharif a=irfansharif - Replace the flow controller level mutex-backed kvflowcontrol.Stream => token bucket map with sync.Map. On kv0/enc=false/nodes=3/cpu=96 accessing this map contributed to a high amount of mutex contention. We observe that this bucket is effectively read-only - entries for keys are written once (on creation) and read frequently after. We don't currently GC these buckets, but even if we did, the same access pattern would hold. We'll note that using a sync.Map is slightly more expensive CPU-wise. - Replace various map accesses with individual variables. We were needly using maps to access one of two variables, keyed by work class, for example when maintaining metrics per work class, or tracking token adjustments. The map accesses appeared prominently in CPU profiles and was unnecessary overhead. - Avoid using log.ExpensiveLogEnabled in hot code paths; it shows up in CPU profiles. - Slightly reduce the surface area of kvflowhandle.Handle.mu when returning flow tokens. - We also annotate various other points in the code where peep-hole optimizations exist, as surfaced by kv0/enc=false/nodes=3/cpu=96. Part of #104154. Release note: None 110088: privilege: automate generation of ByName map r=ecwall a=andyyang890 This patch automates the process of generating the `ByName` map so that any newly added privileges will automatically be included. Epic: None Release note: None 110110: admission: add metric for bypassed IO admission work r=irfansharif a=irfansharif Part of #82743. We introduce an admission.granter.io_tokens_bypassed.kv metric, that tracks the total number of tokens taken by work bypassing admission control. For example, follower writes without flow control. Aside: #109640 ripped out a tokens-taken-without-permission metric that was supposed to capture some of this, but even for standard admission work we'd routinely exercise that code path. When admitting work, we take 1 token, and later take the remaining without permission. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Andy Yang <yang@cockroachlabs.com>
110135: ui: surface flow control metrics in overload dashboard r=irfansharif a=irfansharif Some of this new flow control machinery changes the game for IO admission control. This commits surfaces relevant metrics to the overload dashboard: - kvadmission.flow_controller.{regular,elastic}_wait_duration-p75 - kvadmission.flow_controller.{regular,elastic}_requests_waiting - kvadmission.flow_controller.{regular,elastic}_blocked_stream_count While here, we replace the storage.l0-{sublevels,num-files} metrics with the admission.io.overload instead. The former showed the raw counts instead of normalizing it based on AC target thresholds. And the y-axis scales for sublevels vs. files are an order of magnitude apart, so slightly more annoying to distinguish. Part of #82743. Release note: None 110924: release: remove SREOPS request, as part of deprecating CC qual clusters r=rail a=celiala As per RE-462, we plan to deprecate the CC qualification clusters. In this commit, we remove the SREOPS request, which previously requested updates to the CC qualification clusters. Release note: None Epic: RE-462 Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Celia La <celia@cockroachlabs.com>
Discussed offline with @sumeerbhola, closing this issue as the changes we want to do already have separate issues that are being tracked in the backlog with priorities assigned to them. The rest of them, we don't want to invest time into doing. |
In order of importance and/or done-ness:
Pri * WorkQueues
by:errored
,requested
andadmitted
countsNormalPri
for relevant work queues. Since we have the histograms for the full work queue across all requests, it can tell us how foreground load behaves and how everything else does. Alternatively add segmentation histogram for only NormalPri, everything higher than NormalPri, everything less than NormalPri.Jira issue: CRDB-16641
Epic CRDB-25469
The text was updated successfully, but these errors were encountered: