T-Digest + Varopt sampling proof of concept #18

jmacd · 2021-08-12T15:58:01Z

T-digest can compute a digest from a set of weighted input points.
From the digest, we can estimate the weight of an unweighted input point.
Varopt produces a small set of weighted input points from a large set of weighted input points.

Take these properties together, and we have a potential feedback loop:

Start with the expectation of a uniform distribution; initially all points have identical weights.
Use varopt to compute a set of weighted points from a stream of observations.
Use the set of weighted points to calculate a T-digest
Feed the T-digest back in at step (1) using the inverse weight function.

The use of inverse weight function leaves a single parameter: how much weight to assign to observations outside the previous digest's range. This code assigns a probability to points that lie outside the previous range equal to half the probability of the adjacent extreme bucket.

Update the README

Clean up the godoc

Add circle config

Move simple into subpackage

Restore the simple test

Add simple/doc.go

Rephrase and fix typos

Add varopt benchmarks

Check for NaN values; return error instead of panicking

Inline the large-weight heap to avoid interface conversions

Memory optimization support

Pre-allocate main buffers

* Remove a test-only method * update circle Go version * simplify circleci * mod update

jmacd · 2021-08-12T15:58:27Z

@oertl
I ❤️ T-digest.

jmacd · 2021-08-12T16:08:28Z

examples/tdigest/flat.go

+	if value <= digest[0].Mean {
+		return digest[0].Weight / (2 * sumw)
+	}
+	if value >= digest[len(digest)-1].Mean {


The challenge in this code is to estimate the density of buckets outside the range that was covered in the prior window. Here I make the extreme buckets have half the density of their neighbor, which is a bit arbitrary.

The idea is that in order to use inverse-frequency weighted sampling, you need an estimate for what you haven't seen before. For a numerical distribution, the approach here seems to work but isn't perfect.

For a categorical distribution, I've looked into using a non-parametric estimate based on the theory of species-diversity estimation (see here), which academically derives from Goode-Turing Frequency Estimation. This is a curiosity of mine.

jmacd · 2023-11-27T17:01:40Z

The main branch has been rebased so this can't be used except for reference. Still useful.

jmacd and others added 30 commits November 3, 2019 09:22

Add go module, add simple sampler helper

c9d4254

NewVaropt->New

865dd6a

Add testable examples

3dfa2d1

Update README with examples and godoc link

2be048e

Update

08e548b

Merge pull request #1 from lightstep/jmacd/update_readme

0f1df4e

Update the README

Use a deterministic random source; name examples for godoc

43a69f4

Add comments

68f1fab

Merge pull request #2 from lightstep/jmacd/godocex

bef2465

Clean up the godoc

Add circle config

5c2a41c

Merge pull request #3 from lightstep/jmacd/add_circleci

d93f23f

Add circle config

Move simple into subpackage

cd73568

Merge pull request #6 from lightstep/jmacd/move_simple

8fa9c70

Move simple into subpackage

Restore the simple test

2afca79

Merge pull request #7 from lightstep/jmacd/typos_missing_files

08dc129

Restore the simple test

Add simple/doc.go

7ec92c7

Merge pull request #8 from lightstep/jmacd/more_docs

58f77e4

Add simple/doc.go

Rephrase and fix typos

4b69751

Merge pull request #9 from lightstep/jmacd/more_typos

9305cbe

Rephrase and fix typos

Add a benchmark

4062031

Add benchmark notes

ba028d8

Merge pull request #10 from lightstep/jmacd/benchmark

f865a35

Add varopt benchmarks

Check for NaN weight

db2f575

Return an error, don't panic

5cd2650

Inline the large-weight heap to avoid interface conversions

b92f087

Add a test

b00b2fa

Merge upstream

cd12c03

Merge pull request #12 from lightstep/jmacd/nan_check

358db24

Check for NaN values; return error instead of panicking

Merge upstream

b0dd5c7

Move heap code to internal for better testing

f53b1a8

jmacd and others added 11 commits November 15, 2019 22:04

Merge pull request #11 from lightstep/jmacd/inline_heap

879f6b8

Inline the large-weight heap to avoid interface conversions

Add a Reset method

4b2b9c3

Return the ejected sample from Add

d2bfbc8

Typo

be3c37a

Merge pull request #13 from lightstep/jmacd/optimize

fcf92a6

Memory optimization support

Pre-allocate main buffers

30965f6

Merge pull request #14 from lightstep/jmacd/pre_alloc

3a82c69

Pre-allocate main buffers

Check for infinity, it creates trouble (#15)

8de72e7

Use Apache Software Licence v2 (#17)

69a5c9f

Remove a test-only method (#16)

cc90130

* Remove a test-only method * update circle Go version * simplify circleci * mod update

t-digest example

f3f0808

jmacd commented Aug 12, 2021

View reviewed changes

jmacd mentioned this pull request Aug 12, 2021

Metric Exemplars SDK Specification open-telemetry/opentelemetry-specification#1828

Merged

jmacd force-pushed the main branch from cc90130 to ab68206 Compare November 27, 2023 16:52

missed file

30ae4c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T-Digest + Varopt sampling proof of concept #18

T-Digest + Varopt sampling proof of concept #18

jmacd commented Aug 12, 2021 •

edited

Loading

jmacd commented Aug 12, 2021

jmacd Aug 12, 2021

jmacd commented Nov 27, 2023

T-Digest + Varopt sampling proof of concept #18

Are you sure you want to change the base?

T-Digest + Varopt sampling proof of concept #18

Conversation

jmacd commented Aug 12, 2021 • edited Loading

jmacd commented Aug 12, 2021

jmacd Aug 12, 2021

Choose a reason for hiding this comment

jmacd commented Nov 27, 2023

jmacd commented Aug 12, 2021 •

edited

Loading