Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for reproducing sample from BigQuery #700

Merged
merged 5 commits into from
Feb 9, 2024

Conversation

idreeskhan
Copy link
Contributor

No description provided.

Copy link

codecov bot commented Feb 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (ae43f56) 71.24% compared to head (bae8a41) 71.24%.
Report is 11 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #700   +/-   ##
=======================================
  Coverage   71.24%   71.24%           
=======================================
  Files          40       40           
  Lines        1749     1749           
  Branches      259      255    -4     
=======================================
  Hits         1246     1246           
  Misses        503      503           
Flag Coverage Δ
ratatoolCli 2.99% <ø> (ø)
ratatoolCommon ∅ <ø> (∅)
ratatoolDiffy 31.42% <ø> (ø)
ratatoolExamples 15.91% <ø> (ø)
ratatoolSampling 62.26% <ø> (ø)
ratatoolScalacheck 81.71% <ø> (ø)
ratatoolShapeless 4.31% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Co-authored-by: RickardZwahlen <rickard.zwahlen@gmail.com>
@@ -85,6 +85,16 @@ Leveraging `--fields=<field1,field2,...>` BigSampler can produce a hash based on
are in the sample. For example, `--fields=user_id --sample=0.5` will always produce the same sample
of 50% of users. If multiple records contain the same `user_id` they will all be in or out of the
sample.

### Reproducing within BigQuery
Currently, BigSampler defaults to Farmhash, which is also used in BigQuery. When sampling with a seed and one or more fields,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you link to Farmhash? Also, it looks like the FarmHash repo has been archived. Considering it's used in BigQuery, it shouldn't be much of an issue, but is it possible that we should explore an alternate hashing algo in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we do it would be for the distribution work in #699 , or if farmhash stops working. ATM it's still supported through BQ so not super worried

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a link earlier

@idreeskhan idreeskhan merged commit f975332 into master Feb 9, 2024
1 check passed
@idreeskhan idreeskhan deleted the idrees/bq-sample-docs branch February 9, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants