Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ArrayBackedSet to replace std::set for index in segment #47

Merged
merged 41 commits into from
Jan 23, 2022

Conversation

haiqi96
Copy link
Contributor

@haiqi96 haiqi96 commented Jan 11, 2022

References

N/A

Description

  1. Introduced a new data structure ArrayBackedIntPosSet. The data structure replaced std::unorder_set for tracking which IDs had occurred in a segment. The new data structure wraps a vector<bool> which uses 1 bit for each ID. Compared to std::unorder_set, the ArrayBackedIntPosSet consumes significantly less memory and achieves similar performance.
  2. Removed the variable ID set from the encoded file object. Instead, IDs are added to the segment index as each message is encoded. For files that don't start with timestamps, we don't know whether the file will end up in the segment for files with timestamps or the segment for files without; so this change adds a temporary ID holder in the archive to handle this case.
  3. Embedded file object into archive object to enforce only 1 file can be compressed at any time of the execution.
  4. Updated make-dictionaries-readable to dump the segment index as well.

Validation performed

Ran compression locally on var-logs, openstack-24hrs, hadoop-24hrs. Confirmed that the output is correct and the RSS usage & performance match expectations.

The Following is the change in RSS (Bytes)
Openstack-24hrs: 1,163,071,488 -> 902,053,888 ~22% saving
Spark-hibench: 1,019,035,648 -> 974,655,488 ~4% saving
hadoop-24hrs: 80,744,448 -> 76,058,624 ~ 5.8% saving
var-log: 436,318,208 -> 365,416,448 ~ 16% saving

haiqi96 and others added 23 commits November 26, 2021 15:47
…e added to the added_var_ids vector so as to improve testing coverage
…and_preload flow to directly write str_value of entry into the hash map
@kirkrodrigues
Copy link
Member

Thanks for this. One potential issue I see is that this assumes we compress one file at a time; this is true in the current code, but is not enforced. For instance, we could call Archive::create_file twice, then encode two different files at the same time:

  1. one which starts without timestamps, but has timestamps, and
  2. another which doesn't have timestamps.

file-1's first few logtype IDs and variables will end up in m_var_ids_without_timestamps_temp_holder and file-2's logtype IDs will also end up in m_var_ids_without_timestamps_temp_holder. If file-2 is appended to a segment first, it will be added to m_segment_for_files_without_timestamps and the corresponding segment index will erroneously include file-1's first few logtype IDs and variables.

One potential way to fix this is to enforce that only one encoded file can be created at a time. Specifically, Archive::create_file could stop returning a pointer. Instead the last created file would be maintained as private member. All the corresponding Archive methods which take a file would no longer take a file; instead they would use the member.

@haiqi96
Copy link
Contributor Author

haiqi96 commented Jan 12, 2022

@kirkrodrigues I see. this is a valid concern.
What if we simply let each file has its own "temp holder" for messages without time stamp. in this way even with multiple files opened at the same time there won't be this issue.

I think the better question to ask is that, do we want to assume that we will always only compress one file at a time? If yes, then I can go with the change you described. If we want to make it flexible, then I think what I suggested above makes more sense.

@kirkrodrigues
Copy link
Member

@kirkrodrigues I see. this is a valid concern. What if we simply let each file has its own "temp holder" for messages without time stamp. in this way even with multiple files opened at the same time there won't be this issue.

This is a possibility.

I think the better question to ask is that, do we want to assume that we will always only compress one file at a time? If yes, then I can go with the change you described. If we want to make it flexible, then I think what I suggested above makes more sense.

So in my experience, it's better to add features only as necessary, otherwise you end up designing for potential use cases (e.g., one day we will compress multiple files at the same time) which can hinder you from designing for use cases that are here today. You may ask why we even have this feature if we're not using it then? It used to be used in the early stages of CLP and we haven't had a reason to remove it until now.

So overall, my recommendation would be to enforce that a single file is compressed at a time. In the future, it's easy enough for us to remove this restriction if necessary. However, I'm fine with both alternatives.

@haiqi96
Copy link
Contributor Author

haiqi96 commented Jan 14, 2022

Not ready yet. Only fixed a bug, still need cleaning up

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing things but a few things come to mind that might be worth fixing before I finish the review:

  • How about calling it ArrayBackedPosIntSet instead of IDOccurrenceArray? I think the name more clearly matches the data structure's functionality.
  • Can you fill out the PR template? It serves as good documentation for anyone who reviews the PR later.
  • I see a few repeated errors:
    • When you add includes, make sure to alphabetize them in the group you're adding them to.
    • When you change a function's signature, make sure to check that the corresponding header comment matches.
    • streaming_archive::writer::Archive::mutable_files is no longer necessary since only one file can be mutable at a time.
    • Ensure all your changes match the spacing rules.

components/core/src/IDOccurrenceArray.hpp Outdated Show resolved Hide resolved
components/core/src/IDOccurrenceArray.hpp Outdated Show resolved Hide resolved
components/core/src/streaming_archive/writer/Archive.cpp Outdated Show resolved Hide resolved
components/core/src/streaming_archive/writer/Archive.cpp Outdated Show resolved Hide resolved
…ders and reformat some lines to match coding standard
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May have missed some comments, so we might have to do one more round after this. Thanks for your work!

@haiqi96 haiqi96 changed the title Bit array Use ArrayBackedSet to replace std::set for index in segment Jan 20, 2022
components/core/src/ArrayBackedPosIntSet.hpp Outdated Show resolved Hide resolved
components/core/src/ArrayBackedPosIntSet.hpp Outdated Show resolved Hide resolved
components/core/src/ArrayBackedPosIntSet.hpp Outdated Show resolved Hide resolved
components/core/src/ArrayBackedPosIntSet.hpp Show resolved Hide resolved
components/core/src/ArrayBackedPosIntSet.hpp Outdated Show resolved Hide resolved
components/core/src/streaming_archive/writer/Archive.hpp Outdated Show resolved Hide resolved
components/core/src/streaming_archive/writer/Archive.hpp Outdated Show resolved Hide resolved
components/core/src/clp/utils.hpp Outdated Show resolved Hide resolved
components/core/src/clp/utils.cpp Show resolved Hide resolved
components/core/src/clp/utils.cpp Show resolved Hide resolved
@kirkrodrigues kirkrodrigues merged commit 59f525f into y-scope:main Jan 23, 2022
@haiqi96 haiqi96 deleted the BitArray branch April 29, 2023 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants