Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add runtime dataset fetch and parse in-place #186

Merged
merged 14 commits into from
Feb 9, 2022

Conversation

saik0
Copy link
Contributor

@saik0 saik0 commented Feb 9, 2022

Closes #129
Closes #171
Closes #185

Here's my go at fetching the datasets at runtime

  • Datasets are lazily fetched the first time they're needed (or updated, if local HEAD != origin/master).
  • The zip files are parsed-in place on every benchmark run, to keep the on-disk size down.
  • The parsing is also lazy, and happens at most once.
  • This PR updates any benchmarks that were already using limited data from wikileaks-noquotes to use all the datasets.
  • A fast follow PR will update all the benchmarks.

@Kerollmops Third times the charm?

@saik0 saik0 force-pushed the bench-datasets-fetch-runtime branch from dffba38 to 7461720 Compare February 9, 2022 08:03
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, that is quite good! Even the little indicatif progress bar ❤️

benchmarks/.gitignore Outdated Show resolved Hide resolved
benchmarks/benches/datasets.rs Show resolved Hide resolved
benchmarks/benches/datasets.rs Show resolved Hide resolved
benchmarks/benches/datasets.rs Show resolved Hide resolved
benchmarks/benches/datasets.rs Show resolved Hide resolved
benchmarks/build.rs Show resolved Hide resolved
benchmarks/Cargo.toml Outdated Show resolved Hide resolved
benchmarks/benches/lib.rs Outdated Show resolved Hide resolved
@saik0
Copy link
Contributor Author

saik0 commented Feb 9, 2022

CI is failing in my branch from git. Investigating.

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ready for review? Do you think I can't take a last look at your PR?

.github/workflows/test.yml Show resolved Hide resolved
@saik0
Copy link
Contributor Author

saik0 commented Feb 9, 2022

Is it ready for review? Do you think I can't take a last look at your PR?

If CI is ✅ then yes.

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR looks good to me!

.github/workflows/test.yml Show resolved Hide resolved
.github/workflows/test.yml Outdated Show resolved Hide resolved
benchmarks/Cargo.toml Show resolved Hide resolved
benchmarks/benches/datasets.rs Outdated Show resolved Hide resolved
saik0 and others added 2 commits February 9, 2022 06:30
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so ready to be merged, thank you again for your work and the time spent on this!
bors merge

@bors
Copy link
Contributor

bors bot commented Feb 9, 2022

Build succeeded:

@bors bors bot merged commit 31ed4ca into RoaringBitmap:master Feb 9, 2022
not-jan pushed a commit to not-jan/roaring-rs that referenced this pull request Aug 31, 2022
186: add runtime dataset fetch and parse in-place r=Kerollmops a=saik0

Closes RoaringBitmap#129
Closes RoaringBitmap#171
Closes RoaringBitmap#185

Here's my go at fetching the datasets at runtime

 * Datasets are lazily fetched the first time they're needed (or updated, if local `HEAD != origin/master`).
 * The zip files are parsed-in place on every benchmark run, to keep the on-disk size down.
 * The parsing is also lazy, and happens at most once.
 * This PR updates any benchmarks that were already using limited data from `wikileaks-noquotes` to use all the datasets.
 * A fast follow PR will update all the benchmarks.

`@Kerollmops` Third times the charm?

Co-authored-by: saik0 <github@saik0.net>
Co-authored-by: Joel Pedraza <github@saik0.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sample data for benchmarking
2 participants