-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T-Digest + Varopt sampling proof of concept #18
base: main
Are you sure you want to change the base?
Conversation
Update the README
Clean up the godoc
Add circle config
Move simple into subpackage
Restore the simple test
Add simple/doc.go
Rephrase and fix typos
Add varopt benchmarks
Check for NaN values; return error instead of panicking
Inline the large-weight heap to avoid interface conversions
Memory optimization support
Pre-allocate main buffers
* Remove a test-only method * update circle Go version * simplify circleci * mod update
@oertl |
if value <= digest[0].Mean { | ||
return digest[0].Weight / (2 * sumw) | ||
} | ||
if value >= digest[len(digest)-1].Mean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge in this code is to estimate the density of buckets outside the range that was covered in the prior window. Here I make the extreme buckets have half the density of their neighbor, which is a bit arbitrary.
The idea is that in order to use inverse-frequency weighted sampling, you need an estimate for what you haven't seen before. For a numerical distribution, the approach here seems to work but isn't perfect.
For a categorical distribution, I've looked into using a non-parametric estimate based on the theory of species-diversity estimation (see here), which academically derives from Goode-Turing Frequency Estimation. This is a curiosity of mine.
The main branch has been rebased so this can't be used except for reference. Still useful. |
T-digest can compute a digest from a set of weighted input points.
From the digest, we can estimate the weight of an unweighted input point.
Varopt produces a small set of weighted input points from a large set of weighted input points.
Take these properties together, and we have a potential feedback loop:
The use of inverse weight function leaves a single parameter: how much weight to assign to observations outside the previous digest's range. This code assigns a probability to points that lie outside the previous range equal to half the probability of the adjacent extreme bucket.