Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outliers: configurable statistical detection per fingerprint #79451

Closed
matthewtodd opened this issue Apr 5, 2022 · 1 comment · Fixed by #82473
Closed

outliers: configurable statistical detection per fingerprint #79451

matthewtodd opened this issue Apr 5, 2022 · 1 comment · Fixed by #82473
Assignees
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@matthewtodd
Copy link
Contributor

matthewtodd commented Apr 5, 2022

Here we expand on the walking skeleton by developing a smarter detection algorithm for outliers, perhaps leaning on existing sql stats infrastructure.

Epic: CRDB-13544

Jira issue: CRDB-14871

@matthewtodd matthewtodd added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-observability A-sql-cli-observability Issues related to surfacing SQL observability in SHOW, CRDB_INTERNAL, SYSTEM, etc. labels Apr 5, 2022
@matthewtodd matthewtodd self-assigned this Apr 5, 2022
craig bot pushed a commit that referenced this issue May 4, 2022
80929: authors: add pjtatlow to authors r=rickystewart a=pjtatlow

Release note: None

80975: outliers: extract a detector interface r=matthewtodd a=matthewtodd

This gives us a place to hang the upcoming work in #79451.

Release note: None

Co-authored-by: PJ Tatlow <pj@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
@matthewtodd matthewtodd changed the title Configurable (statistical?) detection for execution outliers outliers: configurable (statistical?) detection May 5, 2022
@matthewtodd matthewtodd added A-sql-observability Related to observability of the SQL layer and removed A-sql-cli-observability Issues related to surfacing SQL observability in SHOW, CRDB_INTERNAL, SYSTEM, etc. labels May 5, 2022
@matthewtodd matthewtodd changed the title outliers: configurable (statistical?) detection outliers: configurable statistical detection May 5, 2022
@matthewtodd matthewtodd changed the title outliers: configurable statistical detection outliers: configurable statistical detection per fingerprint May 5, 2022
craig bot pushed a commit that referenced this issue May 6, 2022
80422: sql: support virtual table indexes with generators r=AlexTalks a=AlexTalks

While previously virtual schema tables had support for defining tables
with a `populate` function, which eagerly loads all rows, or a
`generator` function, which lazy loads each row when called (possibly
running in a worker Goroutine), and also had support for virtual indexes
which would have their own `populate` functions, there was a subtle lack
of support for using virtual indexes with virtual tables that used a
`generator`, since the virtual index constraint logic would fall back to
a (possibly undefined) `populate` function in several cases. This change
fixes the nil pointer exception that could occur if using virtual
indexes with a table using a `generator`, and validates that the virtual
index is supported prior to use.

Release Note: None

Release Justification: Bug fix

80989: outliers: collect statement fingerprint id r=matthewtodd a=matthewtodd

This change also helps set us up for #79451, where we'll be working with
per-fingerprint statement latencies.

Release note: None

81076: Authors: add linville to authors r=davidwding a=mdlinville

Release note: None

81080: opt: calculate lookup join remaining filters more accurately r=mgartner a=mgartner

#### opt: return ordinals of equality filters from memo.ExtractJoinEqualityColumns

This is a prerequisite for future refactoring of
`lookupjoin.ConstraintBuilder`.

Release note: None

#### opt: calculate lookup join remaining filters more accurately

`lookupjoin.ConstraintBuilder` now determines the remaining filters for
a lookup join constraint more accurately by tracking the ordinals of
filters that are covered by the constraint, rather than trying to reduce
the filters original filters based on the key columns, lookup
expression, and constant expression.

Release note: None


81089: vendor: bump panicparse r=erikgrinaker,nvanbenschoten a=tbg

This picks up maruel/panicparse#74.
Without this commit, the `pkg/kv/kvserver/concurrency` datadriven
tests are flaky on Mac M1s since those tests rely on panicparse
to parse strack traces for goroutine monitoring.

Release note: None


Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Co-authored-by: Matt Linville <linville@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
craig bot pushed a commit that referenced this issue May 16, 2022
80991: backupccl: Cosmetic changes to backup_processor.go to improve clarity. r=benbardin a=benbardin

Release note: None

Renames a few things, and moves the sstsink into its own file.

81141: ui: add TTL metrics dashboard r=rafiss,matthewtodd a=otan

Release note: None

example:
<img width="981" alt="image" src="https://user-images.githubusercontent.com/3646147/167403258-c3d546ab-a6af-4088-a090-87cd0ed8bbe7.png">


81194: storage: remove `Engine.SetUpperBound` r=jbowens a=erikgrinaker

Partial resubmit of #79291, which was merged into the `mvcc-range-tombstones` branch.

---

**storage: make `Engine.ClearIterRange` construct iterator internally**

`Engine.ClearIterRange` now creates the iterator itself, instead of
having it passed in. This ensures the iterator is created with
appropriate bounds and options, which becomes even more important with
the introduction of range keys. It will also allow removal of
`Engine.SetUpperBound` in a subsequent commit.

If further flexibility is needed, we can consider passing an
`IterOptions` struct with appropriate bounds instead of a key span.

Release note: None

**storage: remove `Engine.SetUpperBound`**

This patch removes the `Engine.SetUpperBound` method, as it is no longer
in use.

Release note: None

81195: storage: allow multiple iterators in `pebbleReadOnly` and `pebbleBatch` r=jbowens a=erikgrinaker

Partial resubmit of #79291, which was merged into the mvcc-range-tombstones branch.

---

Previously, `pebbleReadOnly` and `pebbleBatch` only supported a single
concurrent iterator of a given type, as this iterator was reused between
`NewMVCCIterator` calls. Attempts to create an additional iterator would
panic with "iterator already in use".

Time-bound iterators were excluded from this reuse, and always created
from scratch, so they sidestepped this limitation. However, we will soon
make these iterators reusable too via a new `SetOptions()` method in
Pebble. When TBIs also become reusable, this will cause
`MVCCIncrementalIterator` to panic with "iterator already in use",
because it always creates two concurrent iterators: one regular and one
time-bound.

This patch therefore ensures all `Reader` implementations support
multiple concurrent iterators, by cloning the existing cached iterator
if it is already in use. This also preserves the pinned engine state for
the new iterator.

Release note: None

81295: outliers: introduce a composite "any" detector r=matthewtodd a=matthewtodd

This gives us a place to hang the upcoming histogram-based detector
coming in #79451.

Release note: None

Co-authored-by: Ben Bardin <bardin@cockroachlabs.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
@kenliu-crl
Copy link
Contributor

manually reviewed and brought up to date

craig bot pushed a commit that referenced this issue May 31, 2022
81676: changefeedccl: Predicates and projections in CDC. r=miretskiy a=miretskiy

Introduce `cdceval` package -- a library for expression evaluation
for CDC.

Changefeed users for a long time requested ability to emit only a
subset of columns. They have also requested ability to filter
out unwanted events (for example, filter out deletions).

This library aims to accomplish those goals.  However, instead of
focusing on a narrow use cases, which would usually be addressed via
addition of new `WITH` option (as done in
#80499), this library
aims to provide support for general expression evaluation.

`cdceval` library provides the following functionality:
  * Ability to evaluate predicates (filters) so that events may be
    filtered.
  * Ability to evaluate projection expressions (`select *`, `select a,
    b,c`, or even `select a + b - c as math_column`)
  * Ability to evaluate virtual compute columns (currently not
    implemented in this PR).

`cdceval` library reuses existing parsing and evaluation libraries, but
adopts them for CDC use case.  CDC events are row level events, and as
such, CDC expressions only make sense in the context of a single
row/single table.  In addition, because CDC events are at least once
semantics, the emitted events must not depend on volatile state.
In summary, any expression is supported except:
  * Volatile functions -- not supported
  * Stable functions, such as `now()`, `current_timestamp()`, etc are
    modified so that they return stable values -- namely events MVCC
    timestamp.
  * Multi row functions (aggregates, windowing functions) are
    disallowed.

`cdceval` also defined few custom, CDC specific functions, such as:
  * `cdc_prev()`: Returns the previous row values as a JSONB object.
  * `cdc_is_delete()`: Returns true if the row was deleted.
  * Others -- see `functions.go`

The follow PRs will add a "front end" to this library to enable creation
and management of predicated changefeeds.

Release Notes: None

82146: util/quantile: import quantile library r=matthewtodd a=matthewtodd

For upcoming outliers work in #79451, we'll re-use the biased streaming
quantiles implementation underlying the prometheus client library's
[summary][1] type.

But in order to monitor and possibly constrain our memory usage, we'll
need a way to measure the size of each `quantile.Stream`. That
functionality is not available upstream, and contributions are
explicitly [not being accepted][2] (and the [original upstream][3], from
bmizerany, lacks further functionality and is similarly inactive), so we
vendor the library here, unmodified from its [v1.0.1][4], in advance of
adding the methods we need.

[1]: https://prometheus.io/docs/practices/histograms/
[2]: beorn7/perks#5 (comment)
[3]: https://github.com/bmizerany/perks
[4]: https://github.com/beorn7/perks/tree/v1.0.1

Release note: None

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jul 5, 2022
82473: outliers: latency quantile detector r=matthewtodd a=matthewtodd

Closes #79451.

Note that this detector comes disabled by default. To enable, use the
`sql.stats.outliers.experimental.latency_quantile_detection.enabled`
cluster setting.

This new detection "algorithm" is a simple, perhaps naïve, heuristic,
though it may be good enough to be useful: we consider a statement an
outlier if its execution latency is > p99 latency for its fingerprint,
so long as it's also greater than twice the median latency and greater
than an arbitrary user-defined threshold.[^1]

We expect the operation of this detector to be memory constrained. To
avoid OOM errors, we offer a cluster setting[^2] to cap the overall
memory used, along with metrics reporting the number of fingerprints
being tracked, memory usage, and "evictions," or fingerprint histories
being dropped due to memory pressure. If a high rate of evictions is
observed, it will be necessary to raise either one or both of the memory
limit and the "interesting" threshold settings to regain a steady state.

Note also that this detector is not designed to be used concurrently by
multiple goroutines. Correctness currently relies on the global lock in
`outliers.Registry`, to be addressed next in #81021.

[^1]: This threshold is defined by a cluster setting,
`sql.stats.outliers.experimental.latency_quantile_detection.interesting_threshold`.
Without such a threshold, we would otherwise flag a statement that, say,
normally executes in 3ms taking 7ms; but 7ms is probably still fast
enough that there's nothing to be done.

[^2]: `sql.stats.outliers.experimental.latency_quantile_detection.memory_limit`

Release note: None

82741: obsservice: ui handler serves db console assets  r=sjbarag,abarganier a=dhartunian

This change enables the Observabilty Service to serve the DB Console UI
itself. Endpoints that CRDB can handle are proxied (`/_admin/`,
`/_status/`, `/ts/`) but the base URL will return a blank HTML page with
JS assets that load the DB Console.

In order to build with the ui code, the `--config=with_ui` flag must be
passed to bazel.

This commit also adds a shortcut to the `dev` command to build
`obsservice` which includes the `--config=with_ui` flag just as we do by
default in `cockroach` builds.

Release note: None

83717: changefeedccl: infer WITH diff from changefeed expression r=[miretskiy] a=HonoreDB

Informs #83322. Specifically this does the "infer whether
or not to set diff" part of it.

This went through a more ambitious phase
(https://github.com/cockroachdb/cockroach/compare/master...HonoreDB:cockroach:cdc_expressions_infer_diff?expand=1)
where it also handled user defined types, but I ended up
agreeing with wiser heads that this was adding too much complexity
for no known use case.

I also came really close to setting envelope=row as the
default, but envelope=row is not supported in all sinks
so again, it felt too messy to do without a little further
conversation.

Release note (sql change): CREATE CHANGEFEED AS statements no longer need to include WITH DIFF when using cdc_prev().

83772: storage: remove experimental markers for MVCC range tombstones r=aliher1911 a=erikgrinaker

GC is still under development, but I think it's fine to remove the marker before GC works to let people start building with it.

---

This removes the experimental markers for MVCC range tombstones, by
renaming functions, methods, and parameters to not include
`experimental`, as well as warning comments.

The changes are entirely mechanical.

Release note: None

83784: scexec: refactor schema changer job update r=postamar a=postamar

This PR rebases #83724 on top of a few commits which refactor the schema changer job update machinery.


Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Aaron Zinger <zinger@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Marius Posta <marius@cockroachlabs.com>
@craig craig bot closed this as completed in 68aa3a4 Jul 5, 2022
surajr10 pushed a commit to surajr10/cockroach that referenced this issue Jul 12, 2022
Closes cockroachdb#79451.

Note that this detector comes disabled by default. To enable, use the
`sql.stats.outliers.experimental.latency_quantile_detection.enabled`
cluster setting.

This new detection "algorithm" is a simple, perhaps naïve, heuristic,
though it may be good enough to be useful: we consider a statement an
outlier if its execution latency is > p99 latency for its fingerprint,
so long as it's also greater than twice the median latency and greater
than an arbitrary user-defined threshold.[^1]

We expect the operation of this detector to be memory constrained. To
avoid OOM errors, we offer a cluster setting[^2] to cap the overall
memory used, along with metrics reporting the number of fingerprints
being tracked, memory usage, and "evictions," or fingerprint histories
being dropped due to memory pressure. If a high rate of evictions is
observed, it will be necessary to raise either one or both of the memory
limit and the "interesting" threshold settings to regain a steady state.

Note also that this detector is not designed to be used concurrently by
multiple goroutines. Correctness currently relies on the global lock in
`outliers.Registry`, to be addressed next in cockroachdb#81021.

[^1]: This threshold is defined by a cluster setting,
`sql.stats.outliers.experimental.latency_quantile_detection.interesting_threshold`.
Without such a threshold, we would otherwise flag a statement that, say,
normally executes in 3ms taking 7ms; but 7ms is probably still fast
enough that there's nothing to be done.

[^2]: `sql.stats.outliers.experimental.latency_quantile_detection.memory_limit`

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants