Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark datasets #171

Closed
wants to merge 2 commits into from
Closed

Conversation

saik0
Copy link
Contributor

@saik0 saik0 commented Jan 31, 2022

Closes #129

Adds CRoaring benchmark datasets. File contents are zstd compressed serialized bitmaps using a shared dictionary. All together adds about 18 MiB.

Utilizing the datasets is out of scope for this PR.

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't the initial plan to create a submodule of the real-roaring-dataset repository and to let the user submodule init/update by himself?

@saik0
Copy link
Contributor Author

saik0 commented Feb 1, 2022

Wasn't the initial plan to create a submodule of the real-roaring-dataset repository and to let the user submodule init/update by himself?

Correct. This avoid a manual step.

Is there some other concern regarding adding these? Would it bloat the size of the published crate?

@Kerollmops
Copy link
Member

Correct. This avoid a manual step.

Yup, but cloning the repository is a little issue now, 18Mo of dataset is a lot. Also, I'm not sure that we want to make git care about them. I would prefer that we depend on the official repo, even with a specific revision.

Is there some other concern regarding adding these? Would it bloat the size of the published crate?

Yup, the other concern is about the size of the repository itself, not the crate as you put the datasets in the benchmark subcrate. I hope that cargo doesn't push useless folders!

@saik0
Copy link
Contributor Author

saik0 commented Feb 1, 2022

Yup, the other concern is about the size of the repository itself

The difference between this PR and all the zips is rather large ~18 MB compared to ~95. zstd is ✨ magic
This size reduction was what led me to just include them in the benchmark crate.

I hope that cargo doesn't push useless folders!

I have verified, it does not.

My main concern is admittedly a subjective one. It just feels icky 🤢 to introduce a manual fetch step to run benchmarks.

@Kerollmops
Copy link
Member

Kerollmops commented Feb 1, 2022

My main concern is admittedly a subjective one. It just feels icky 🤢 to introduce a manual fetch step to run benchmarks.

Maybe we can git submodule init/update by ourselves in the build.rs. I just checked the size of the repository, not even compressed, it's 628K, adding 18M to that is a lot. The other downside I see is that the user needs to install zstd to be able to run the benchmarks.

I am sure that we can find a solution to this by automating the clone in the build.rs script or something. It could even just be a call to the git submodule init/update Command. What do you think?

It looks to be that easy:

https://github.com/rust-lang/git2-rs/blob/c55bd6dbdba52f90788150180ef124ef6c90daa6/libgit2-sys/build.rs#L27-L31

@saik0
Copy link
Contributor Author

saik0 commented Feb 1, 2022

The other downside I see is that the user needs to install zstd to be able to run the benchmarks.

This uses the zstd create under the hood, they are decompressed in the benchmark code.

I just checked the size of the repository, not even compressed, it's 628K, adding 18M to that is a lot.

Valid.

I'm going to close this and explore using git binary through Command and also check to see if there's some other magic cargo can do. It might make sense to have a crate of datasets.

@Kerollmops
Copy link
Member

Kerollmops commented Feb 8, 2022

I'm going to close this and explore using git binary through Command and also check to see if there's some other magic cargo can do. It might make sense to have a crate of datasets.

BTW, I don't think that creating a crate to store the benchmarks is a great idea: rust-lang/crates.io#195. The crate size limit is around 10 MB.

bors bot added a commit that referenced this pull request Feb 9, 2022
186: add runtime dataset fetch and parse in-place r=Kerollmops a=saik0

Closes #129
Closes #171
Closes #185

Here's my go at fetching the datasets at runtime

 * Datasets are lazily fetched the first time they're needed (or updated, if local `HEAD != origin/master`).
 * The zip files are parsed-in place on every benchmark run, to keep the on-disk size down.
 * The parsing is also lazy, and happens at most once.
 * This PR updates any benchmarks that were already using limited data from `wikileaks-noquotes` to use all the datasets.
 * A fast follow PR will update all the benchmarks.

`@Kerollmops` Third times the charm?

Co-authored-by: saik0 <github@saik0.net>
Co-authored-by: Joel Pedraza <github@saik0.net>
not-jan pushed a commit to not-jan/roaring-rs that referenced this pull request Aug 31, 2022
186: add runtime dataset fetch and parse in-place r=Kerollmops a=saik0

Closes RoaringBitmap#129
Closes RoaringBitmap#171
Closes RoaringBitmap#185

Here's my go at fetching the datasets at runtime

 * Datasets are lazily fetched the first time they're needed (or updated, if local `HEAD != origin/master`).
 * The zip files are parsed-in place on every benchmark run, to keep the on-disk size down.
 * The parsing is also lazy, and happens at most once.
 * This PR updates any benchmarks that were already using limited data from `wikileaks-noquotes` to use all the datasets.
 * A fast follow PR will update all the benchmarks.

`@Kerollmops` Third times the charm?

Co-authored-by: saik0 <github@saik0.net>
Co-authored-by: Joel Pedraza <github@saik0.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sample data for benchmarking
2 participants