Use ArrayBackedSet to replace std::set for index in segment #47

haiqi96 · 2022-01-11T15:37:53Z

References

N/A

Description

Introduced a new data structure ArrayBackedIntPosSet. The data structure replaced std::unorder_set for tracking which IDs had occurred in a segment. The new data structure wraps a vector<bool> which uses 1 bit for each ID. Compared to std::unorder_set, the ArrayBackedIntPosSet consumes significantly less memory and achieves similar performance.
Removed the variable ID set from the encoded file object. Instead, IDs are added to the segment index as each message is encoded. For files that don't start with timestamps, we don't know whether the file will end up in the segment for files with timestamps or the segment for files without; so this change adds a temporary ID holder in the archive to handle this case.
Embedded file object into archive object to enforce only 1 file can be compressed at any time of the execution.
Updated make-dictionaries-readable to dump the segment index as well.

Validation performed

Ran compression locally on var-logs, openstack-24hrs, hadoop-24hrs. Confirmed that the output is correct and the RSS usage & performance match expectations.

The Following is the change in RSS (Bytes)
Openstack-24hrs: 1,163,071,488 -> 902,053,888 ~22% saving
Spark-hibench: 1,019,035,648 -> 974,655,488 ~4% saving
hadoop-24hrs: 80,744,448 -> 76,058,624 ~ 5.8% saving
var-log: 436,318,208 -> 365,416,448 ~ 16% saving

… how ids_in_segment gets populated

…e added to the added_var_ids vector so as to improve testing coverage

…ve redundant comment

…and_preload flow to directly write str_value of entry into the hash map

…irect insertion

…ruction

…assigned in preload flow

kirkrodrigues · 2022-01-12T11:05:15Z

Thanks for this. One potential issue I see is that this assumes we compress one file at a time; this is true in the current code, but is not enforced. For instance, we could call Archive::create_file twice, then encode two different files at the same time:

one which starts without timestamps, but has timestamps, and
another which doesn't have timestamps.

file-1's first few logtype IDs and variables will end up in m_var_ids_without_timestamps_temp_holder and file-2's logtype IDs will also end up in m_var_ids_without_timestamps_temp_holder. If file-2 is appended to a segment first, it will be added to m_segment_for_files_without_timestamps and the corresponding segment index will erroneously include file-1's first few logtype IDs and variables.

One potential way to fix this is to enforce that only one encoded file can be created at a time. Specifically, Archive::create_file could stop returning a pointer. Instead the last created file would be maintained as private member. All the corresponding Archive methods which take a file would no longer take a file; instead they would use the member.

haiqi96 · 2022-01-12T15:49:20Z

@kirkrodrigues I see. this is a valid concern.
What if we simply let each file has its own "temp holder" for messages without time stamp. in this way even with multiple files opened at the same time there won't be this issue.

I think the better question to ask is that, do we want to assume that we will always only compress one file at a time? If yes, then I can go with the change you described. If we want to make it flexible, then I think what I suggested above makes more sense.

kirkrodrigues · 2022-01-12T21:50:17Z

@kirkrodrigues I see. this is a valid concern. What if we simply let each file has its own "temp holder" for messages without time stamp. in this way even with multiple files opened at the same time there won't be this issue.

This is a possibility.

I think the better question to ask is that, do we want to assume that we will always only compress one file at a time? If yes, then I can go with the change you described. If we want to make it flexible, then I think what I suggested above makes more sense.

So in my experience, it's better to add features only as necessary, otherwise you end up designing for potential use cases (e.g., one day we will compress multiple files at the same time) which can hinder you from designing for use cases that are here today. You may ask why we even have this feature if we're not using it then? It used to be used in the early stages of CLP and we haven't had a reason to remove it until now.

So overall, my recommendation would be to enforce that a single file is compressed at a time. In the future, it's easy enough for us to remove this restriction if necessary. However, I'm fine with both alternatives.

haiqi96 · 2022-01-14T06:27:01Z

Not ready yet. Only fixed a bug, still need cleaning up

… class

kirkrodrigues

Still reviewing things but a few things come to mind that might be worth fixing before I finish the review:

How about calling it ArrayBackedPosIntSet instead of IDOccurrenceArray? I think the name more clearly matches the data structure's functionality.
Can you fill out the PR template? It serves as good documentation for anyone who reviews the PR later.
I see a few repeated errors:
- When you add includes, make sure to alphabetize them in the group you're adding them to.
- When you change a function's signature, make sure to check that the corresponding header comment matches.
- streaming_archive::writer::Archive::mutable_files is no longer necessary since only one file can be mutable at a time.
- Ensure all your changes match the spacing rules.

components/core/src/IDOccurrenceArray.hpp

components/core/src/streaming_archive/writer/Archive.cpp

…ders and reformat some lines to match coding standard

components/core/src/utils/make_dictionaries_readable/make-dictionaries-readable.cpp

… deprecated function defintion and slightly formatted the code

components/core/CMakeLists.txt

kirkrodrigues

May have missed some comments, so we might have to do one more round after this. Thanks for your work!

components/core/src/ArrayBackedPosIntSet.hpp

components/core/src/utils/make_dictionaries_readable/make-dictionaries-readable.cpp

components/core/src/ArrayBackedPosIntSet.hpp

components/core/src/streaming_archive/writer/Archive.hpp

components/core/src/clp/utils.hpp

components/core/src/clp/utils.cpp

haiqi96 and others added 23 commits November 26, 2021 15:47

1.Remove uncommited entries from memory during compression. 2.Updated…

fa435ab

… how ids_in_segment gets populated

Updated unit-test

c34dcfa

Fix a mistake in the test and added a new variable string that will b…

667251e

…e added to the added_var_ids vector so as to improve testing coverage

address some issues that are easier to fix in the pull request review

6a74d3c

fix a variable name mismatch that was not caught by compiler and remo…

611c992

…ve redundant comment

remove unused arguments from function

5dd2954

replace dynamically created entry with static ones. updated the open_…

5f12a10

…and_preload flow to directly write str_value of entry into the hash map

better function refactor

2d4cbf1

use unique_ptr for dynamically allocated set. replace for loop with d…

c21ac31

…irect insertion

address some issues in code review

366a463

Move LogEntry off from stack to avoid unnecessary allocation and dest…

9f97f79

…ruction

renamed add_occurrence function and slightly modified how next_id is …

07808cd

…assigned in preload flow

Fixed a few small places for cleaner code

17d256b

Merge branch 'y-scope:main' into main

0c6b202

Merge branch 'y-scope:main' into main

a96ff48

Merge branch 'y-scope:main' into main

66b9a6a

port changes over

1ba3d32

reorganize as template

e2d65ba

add specific function for write to compressor

60d0d28

rename the class, further cleaned up unused functions

52abda1

revert unintended changes

7d6fc76

revert unintended changes

ac21336

add CMakefile

6d10463

haiqi96 added 2 commits January 14, 2022 01:17

backip

50a8373

fix

e288963

combine open and create file, and some other clean up

6969c3f

haiqi96 added 2 commits January 14, 2022 21:41

rename one member variable and add more comments to the newly created…

ae7322b

… class

further clean up comment

25a3292

kirkrodrigues requested changes Jan 16, 2022

View reviewed changes

rename the new data sturcture, update function description in the hea…

e3d7a49

…ders and reformat some lines to match coding standard

haiqi96 commented Jan 17, 2022

View reviewed changes

components/core/src/utils/make_dictionaries_readable/make-dictionaries-readable.cpp Outdated Show resolved Hide resolved

haiqi96 added 3 commits January 16, 2022 19:33

adjust the order of header inclusion and constant define. removed one…

80963a4

… deprecated function defintion and slightly formatted the code

simplified unnecessarily complicated sorting logic

ed3c85c

fix

50a693f

kirkrodrigues reviewed Jan 19, 2022

View reviewed changes

components/core/CMakeLists.txt Outdated Show resolved Hide resolved

kirkrodrigues requested changes Jan 19, 2022

View reviewed changes

haiqi96 added 3 commits January 19, 2022 22:26

fix various comment issue

d8da846

Add more fixes

b148bee

fix nit and rename some functions

e73d4b2

haiqi96 changed the title ~~Bit array~~ Use ArrayBackedSet to replace std::set for index in segment Jan 20, 2022

fix issues in the comment

7baf5bd

kirkrodrigues requested changes Jan 23, 2022

View reviewed changes

haiqi96 added 5 commits January 22, 2022 23:57

more fixes

4887b72

revert one const auto change

a44e11b

fix nit

e631739

add nullptr check and fix captitalization

b6b7157

capitalization

fa99509

kirkrodrigues approved these changes Jan 23, 2022

View reviewed changes

kirkrodrigues merged commit 59f525f into y-scope:main Jan 23, 2022

haiqi96 deleted the BitArray branch April 29, 2023 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ArrayBackedSet to replace std::set for index in segment #47

Use ArrayBackedSet to replace std::set for index in segment #47

haiqi96 commented Jan 11, 2022 •

edited

Loading

kirkrodrigues commented Jan 12, 2022

haiqi96 commented Jan 12, 2022

kirkrodrigues commented Jan 12, 2022

haiqi96 commented Jan 14, 2022

kirkrodrigues left a comment

kirkrodrigues left a comment

Use ArrayBackedSet to replace std::set for index in segment #47

Use ArrayBackedSet to replace std::set for index in segment #47

Conversation

haiqi96 commented Jan 11, 2022 • edited Loading

References

Description

Validation performed

kirkrodrigues commented Jan 12, 2022

haiqi96 commented Jan 12, 2022

kirkrodrigues commented Jan 12, 2022

haiqi96 commented Jan 14, 2022

kirkrodrigues left a comment

Choose a reason for hiding this comment

kirkrodrigues left a comment

Choose a reason for hiding this comment

haiqi96 commented Jan 11, 2022 •

edited

Loading