-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ArrayBackedSet to replace std::set for index in segment #47
Conversation
… how ids_in_segment gets populated
…e added to the added_var_ids vector so as to improve testing coverage
…ve redundant comment
…and_preload flow to directly write str_value of entry into the hash map
…assigned in preload flow
Thanks for this. One potential issue I see is that this assumes we compress one file at a time; this is true in the current code, but is not enforced. For instance, we could call Archive::create_file twice, then encode two different files at the same time:
file-1's first few logtype IDs and variables will end up in One potential way to fix this is to enforce that only one encoded file can be created at a time. Specifically, Archive::create_file could stop returning a pointer. Instead the last created file would be maintained as private member. All the corresponding Archive methods which take a file would no longer take a file; instead they would use the member. |
@kirkrodrigues I see. this is a valid concern. I think the better question to ask is that, do we want to assume that we will always only compress one file at a time? If yes, then I can go with the change you described. If we want to make it flexible, then I think what I suggested above makes more sense. |
This is a possibility.
So in my experience, it's better to add features only as necessary, otherwise you end up designing for potential use cases (e.g., one day we will compress multiple files at the same time) which can hinder you from designing for use cases that are here today. You may ask why we even have this feature if we're not using it then? It used to be used in the early stages of CLP and we haven't had a reason to remove it until now. So overall, my recommendation would be to enforce that a single file is compressed at a time. In the future, it's easy enough for us to remove this restriction if necessary. However, I'm fine with both alternatives. |
Not ready yet. Only fixed a bug, still need cleaning up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing things but a few things come to mind that might be worth fixing before I finish the review:
- How about calling it
ArrayBackedPosIntSet
instead ofIDOccurrenceArray
? I think the name more clearly matches the data structure's functionality. - Can you fill out the PR template? It serves as good documentation for anyone who reviews the PR later.
- I see a few repeated errors:
- When you add includes, make sure to alphabetize them in the group you're adding them to.
- When you change a function's signature, make sure to check that the corresponding header comment matches.
streaming_archive::writer::Archive::mutable_files
is no longer necessary since only one file can be mutable at a time.- Ensure all your changes match the spacing rules.
…ders and reformat some lines to match coding standard
components/core/src/utils/make_dictionaries_readable/make-dictionaries-readable.cpp
Outdated
Show resolved
Hide resolved
… deprecated function defintion and slightly formatted the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May have missed some comments, so we might have to do one more round after this. Thanks for your work!
References
N/A
Description
vector<bool>
which uses 1 bit for each ID. Compared to std::unorder_set, the ArrayBackedIntPosSet consumes significantly less memory and achieves similar performance.Validation performed
Ran compression locally on var-logs, openstack-24hrs, hadoop-24hrs. Confirmed that the output is correct and the RSS usage & performance match expectations.
The Following is the change in RSS (Bytes)
Openstack-24hrs: 1,163,071,488 -> 902,053,888 ~22% saving
Spark-hibench: 1,019,035,648 -> 974,655,488 ~4% saving
hadoop-24hrs: 80,744,448 -> 76,058,624 ~ 5.8% saving
var-log: 436,318,208 -> 365,416,448 ~ 16% saving