Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T-Digest + Varopt sampling proof of concept #18

Draft
wants to merge 42 commits into
base: main
Choose a base branch
from
Draft

T-Digest + Varopt sampling proof of concept #18

wants to merge 42 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Aug 12, 2021

T-digest can compute a digest from a set of weighted input points.
From the digest, we can estimate the weight of an unweighted input point.
Varopt produces a small set of weighted input points from a large set of weighted input points.

Take these properties together, and we have a potential feedback loop:

  1. Start with the expectation of a uniform distribution; initially all points have identical weights.
  2. Use varopt to compute a set of weighted points from a stream of observations.
  3. Use the set of weighted points to calculate a T-digest
  4. Feed the T-digest back in at step (1) using the inverse weight function.

The use of inverse weight function leaves a single parameter: how much weight to assign to observations outside the previous digest's range. This code assigns a probability to points that lie outside the previous range equal to half the probability of the adjacent extreme bucket.

jmacd and others added 30 commits November 3, 2019 09:22
Check for NaN values; return error instead of panicking
@jmacd
Copy link
Contributor Author

jmacd commented Aug 12, 2021

@oertl
I ❤️ T-digest.

if value <= digest[0].Mean {
return digest[0].Weight / (2 * sumw)
}
if value >= digest[len(digest)-1].Mean {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge in this code is to estimate the density of buckets outside the range that was covered in the prior window. Here I make the extreme buckets have half the density of their neighbor, which is a bit arbitrary.

The idea is that in order to use inverse-frequency weighted sampling, you need an estimate for what you haven't seen before. For a numerical distribution, the approach here seems to work but isn't perfect.

For a categorical distribution, I've looked into using a non-parametric estimate based on the theory of species-diversity estimation (see here), which academically derives from Goode-Turing Frequency Estimation. This is a curiosity of mine.

@jmacd
Copy link
Contributor Author

jmacd commented Nov 27, 2023

The main branch has been rebased so this can't be used except for reference. Still useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant