-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs for reproducing sample from BigQuery #700
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #700 +/- ##
=======================================
Coverage 71.24% 71.24%
=======================================
Files 40 40
Lines 1749 1749
Branches 259 255 -4
=======================================
Hits 1246 1246
Misses 503 503
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Co-authored-by: RickardZwahlen <rickard.zwahlen@gmail.com>
@@ -85,6 +85,16 @@ Leveraging `--fields=<field1,field2,...>` BigSampler can produce a hash based on | |||
are in the sample. For example, `--fields=user_id --sample=0.5` will always produce the same sample | |||
of 50% of users. If multiple records contain the same `user_id` they will all be in or out of the | |||
sample. | |||
|
|||
### Reproducing within BigQuery | |||
Currently, BigSampler defaults to Farmhash, which is also used in BigQuery. When sampling with a seed and one or more fields, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you link to Farmhash? Also, it looks like the FarmHash repo has been archived. Considering it's used in BigQuery, it shouldn't be much of an issue, but is it possible that we should explore an alternate hashing algo in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we do it would be for the distribution work in #699 , or if farmhash stops working. ATM it's still supported through BQ so not super worried
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a link earlier
No description provided.