clp-s: Update core functionality to prepare for generic parser support #355

gibber9809 · 2024-04-15T20:10:48Z

Description

This PR contains a laundry list of changes which in combination pave the way for supporting generic parsers in clp-s. This includes some refactoring and new interfaces to add support for special entries in a schema to denote which columns a given parser is responsible for, as well as changes to allow repeating MPT IDs inside of a schema. The bulk of this change though focuses on refactoring and optimizing clp-s internals.

Early experience with adding support for array-structurization has shown that some parts of clp-s had been written in such an obtuse way that they were hard to extend; for this reason column resolution has been rewritten, and handling of wildcard columns in Output.cpp has been simplified. As well, the changes for array-structurization revealed several performance bugs which this PR addresses.

It should be noted that these optimizations, while targeted at making generic parsers viable, also help baseline clp-s (~1.5x search speedup for large archives for some internal benchmarks).

To summarize the changes:

Add interface for custom parsers to easily create the special <parser type, length> tags in the unordered region of the schema
Allow repeated MPT node IDs in the unordered region of the schema
Rewrite Column Resolution code
Partial rewrite for handling pure wildcards in Output.cpp to reduce the number of custom structures they use
Change to interface between SchemaReader and Output to handle repeated MPT node IDs
Optimizations to improve speed for decompressing schemas with many columns
Optimizations on in memory representation of schema tree
Optimizations on in memory representation of ColumnReader
Optimization to initialize json marshalling data structures only after at least one result has been matched for some table
Several optimisations to re-use buffers/memory between loading different tables
Change to table metadata section to record decompressed in-memory size of each table
New UnalignedSpan and ManagedBufferViewReader classes that allow ColumnReaders to safely share view into a buffer containing all data for a table
Remove unused/unnecessary data structures from Output

Validation performed

Thorough benchmarking on core datasets after applying array structurization on top of this PR
Testing to validate that edge cases for column resolution are handled in a way that is consistent with the previous implementation

…function

…schema entry

…rrays were not recorded

…son arrays

- significant rewrite from previous marshalling approach - some edge cases this change introduces aren't handled in JsonSerializer such that the document description is correct, but some brackets and commas can be wrong (e.g. [[]] decompresses to []]) - this change doesn't yet handle documents that are arrays at the root level

…in a row

…emaMatch work for structured arrays

…Descriptors

…e consistent

…rmance

…ing new memory each time

…overhead when a searched schema returns no results

… in SchemaTree

…d to avoid re-allocating internal buffer

…oid large number of allocations and calls into zstd

…ization

wraymo

Nice work! Left some comments, mostly regarding the style. I will look into the search logic later.

wraymo · 2024-04-17T15:52:32Z

components/core/src/clp_s/SchemaReader.hpp

+     * @param num_messages
+     * @param should_marshal_records
+     */
+    void reset(


Since we'll reset SchemaReader in this method, do we need to have a non-default constructor?

Good point. I had some vague idea that we might want a non-default constructor for a future situation where we want multiple schema readers alive at the same time, but I think it makes sense to have only a default constructor and use this reset method until we need another interface.

components/core/src/clp_s/SchemaReader.hpp

wraymo · 2024-04-17T15:59:20Z

components/core/src/clp_s/Schema.hpp

+     * @param schema_entry
+     * @return Whether the schema_entry is the delimeter for an unordered object or not
+     */
+    static int32_t schema_entry_is_unordered_object(int32_t schema_entry) {


Why not return bool?

You're right, this should be bool.

components/core/src/clp_s/Schema.hpp

wraymo · 2024-04-18T18:50:25Z

components/core/src/clp_s/Schema.hpp

+    /**
+     * Extracts the unordered object length from an unordered object delimeter.
+     * @param schema_entry
+     * @return The extracted NodeType


Is it the return value?

Right this should be "The extracted object length"

wraymo · 2024-04-18T19:03:01Z

components/core/src/clp_s/SchemaReader.hpp

@@ -26,36 +26,42 @@ class FilterClass {
    virtual void init(
            SchemaReader* reader,
            int32_t schema_id,
-            std::unordered_map<int32_t, BaseColumnReader*>& columns
+            std::vector<BaseColumnReader*> const& column_readers


change the parameter name in the description?

components/core/src/clp_s/Utils.hpp

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>

…rser-changes

wraymo

Sorry for the late review. Great work that covers a lot of aspects!

wraymo · 2024-05-05T21:31:37Z

components/core/src/clp_s/ParsedMessage.hpp

     * @param value
     */
-    inline void add_value(int32_t node_id, std::string const& value) {
+    template <typename T>
+    inline void add_value(int32_t node_id, T const& value) {


Also add @param node_id?

wraymo · 2024-05-05T21:34:52Z

components/core/src/clp_s/ParsedMessage.hpp

        m_message.emplace(node_id, value);
    }

    /**
-     * Adds a boolean value to the message for a given MST node ID.
+     * Adds a timestamp value and its encoding to the message for a given MST node ID.


And @param encoding_id?

wraymo · 2024-05-06T20:27:46Z

components/core/src/clp_s/SchemaReader.hpp


-    std::map<int32_t, std::variant<int64_t, double, std::string, uint8_t>> m_extracted_values;
+    std::map<int32_t, std::pair<size_t, Span<int32_t>>> m_global_id_to_unordered_object;


Will we use second of it in the future?

Yes, it is passed to the marshalling code for the relevant generic parser.

wraymo · 2024-05-06T20:37:57Z

components/core/src/clp_s/ArchiveReader.cpp

+    }
+
+    if (should_marshal_records) {
+        reader.mark_unordered_object(object_readers_begin, mst_subtree_root_node_id, schema_ids);


Do you want to change object_readers_begin to something like ...begin_pos or ...begin_offset?

wraymo · 2024-05-06T21:04:27Z

components/core/src/clp_s/SchemaTree.hpp

+     * Finds the root node for a subtree matching a given type given the root node for some subtree
+     * in which the subtree we are looking for can be found, and some descendent node of the subtree


I would write it in this way. But anything that can make it clear is fine.

/** * Finds an ancestor node within a subtree that matches the given type. When multiple matching * nodes exist, returns the one closest to the root node of the subtree. * @param subtree_root_node The root node of the subtree * @param node The node to start searching from * @param subtree_type The type of the ancestor node to find * @return The ID of the ancestor node if it exists, otherwise -1 */ [[nodiscard]] int32_t find_matching_subtree_root_in_subtree( int32_t const subtree_root_node, int32_t node, NodeType type ) const;

wraymo · 2024-05-07T15:58:33Z

components/core/src/clp_s/search/Output.hpp

+    std::unordered_map<int32_t, std::vector<ClpStringColumnReader*>> m_clp_string_readers;
+    std::unordered_map<int32_t, std::vector<VariableStringColumnReader*>> m_var_string_readers;


Are the vectors used to support duplicate node ids in an unordered region? And I think we can guarantee that columns with the same node id are within the same object and at the same level?

Yes, this is to support duplicate node IDs. They can be in different unordered objects depending on how you look at it, e.g. an unordered object can appear twice inside of an unordered array.

components/core/src/clp_s/search/SchemaMatch.cpp

wraymo · 2024-05-07T21:00:15Z

components/core/src/clp_s/search/SchemaMatch.cpp

+
+        // Check if the current node is accepted
+        auto const& cur_node = m_tree->get_node(cur_node_id);
+        bool empty_key = cur_node.get_key_name().empty();


What about is_key_name_empty?

components/core/src/clp_s/search/SchemaMatch.cpp

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>

wraymo · 2024-05-09T17:57:59Z

components/core/src/clp_s/search/Output.cpp

            || m_match.schema_searches_against_column(schema_id, column_id))
        {
            ClpStringColumnReader* clp_reader = dynamic_cast<ClpStringColumnReader*>(column_reader);
            VariableStringColumnReader* var_reader
                    = dynamic_cast<VariableStringColumnReader*>(column_reader);
+            DateStringColumnReader* date_reader
+                    = dynamic_cast<DateStringColumnReader*>(column_reader);
            if (clp_reader != nullptr && clp_reader->get_type() == NodeType::ClpString) {
                m_clp_string_readers[column_id].push_back(clp_reader);
            } else if (var_reader != nullptr && var_reader->get_type() == NodeType::VarString) {


Do you want to change clp_reader != nullptr and var_reader != nullptr to nullptr != clpreader and nullptr != var_reader?

Sure, might as well

wraymo

The commit message looks good to me. @kirkrodrigues Do you want to take a look at this PR?

Needs some discussion on the span classes

…rser-changes

gibber9809 added 30 commits April 12, 2024 02:25

Refactor clp-s NodeType enum and add StructuredArray NodeType

3e0435c

Implement compression path for naive array structurization

5b8c97c

Fix bugs causing AST to be displayed incorrectly using print() debug …

d5440c1

…function

Use bithacks to encode unstructured object delimiters using a single …

1f3b1f3

…schema entry

Add docstrings to new methods in Schema.hpp

959f7a8

Record the number ordered and unordered entries in each schema

7ea6eb4

Get most of the way towards decompression for structurized arrays

8bc1356

Fix trivial bug related to marshalling empty structured arrays

87502f3

Fix bug where keys for string columns for objects inside structured a…

f6b8aa5

…rrays were not recorded

Add nearly correct implementation of serialization for structurized j…

86d7a8d

…son arrays

Fix trivial bug

8cf5f0c

Track node depth in schema tree

624f4c4

Make ColumnDescriptor clean up sequences of multiple wildcard tokens …

994946f

…in a row

Rewrite column resolution to increase understandability, and make Sch…

c2ea3b6

…emaMatch work for structured arrays

Fix edge case where pure wildcard flag is sometimes not set on Column…

cd812e5

…Descriptors

Fix bug in Column Resolution

be178be

Add comment documenting known broken case for column resolution

456ab68

Support structured arrays in last stage of search

8ed2f83

Rename first object in hierarchy from 'root' to the empty string to b…

f686756

…e consistent

Simplify unordered object marshalling code and slightly improve perfo…

7007c8a

…rmance

Remove unused include

7ee691e

Implement optimization to reuse same schema reader instead of allocat…

5c7e6f9

…ing new memory each time

Lazily initialize data-structures related to serialization to reduce …

36496b5

…overhead when a searched schema returns no results

Fix performance bug causing slow loading of large SchemaMap

a0ac519

Use vector<SchemaNode> instead of vector<std::shared_ptr<SchemaNode>>…

c67b85b

… in SchemaTree

Introduce hint for ZstdDecompressor that the compressor will be reuse…

acf8b92

…d to avoid re-allocating internal buffer

Rework column readers to use unaligned views into shared buffer to av…

edd32b2

…oid large number of allocations and calls into zstd

Remove unused struct

8c1f015

Get rid of hack related to values inside of arrays during json serial…

7d1e236

…ization

Remove code specific to array-structurization support

262ac9e

gibber9809 requested a review from wraymo April 15, 2024 20:10

wraymo reviewed Apr 18, 2024

View reviewed changes

gibber9809 and others added 2 commits April 19, 2024 10:56

Apply suggestions from code review

ba7b844

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>

Address more review comments

be045e2

gibber9809 requested a review from wraymo April 19, 2024 15:55

gibber9809 added 5 commits April 22, 2024 11:04

Add missing entry in CMakeLists.txt

96596e6

Merge remote-tracking branch 'upstream/main' into clp-core-generic-pa…

e673487

…rser-changes

Fix bug with uninitialized local schema tree during marshalling

d5218e1

Fix performance bug with wildcard search on large number of schemas

33fe38f

Merge remote-tracking branch 'upstream/main' into clp-core-generic-pa…

870da2f

…rser-changes

wraymo reviewed May 7, 2024

View reviewed changes

gibber9809 and others added 2 commits May 8, 2024 10:49

Apply suggestions from code review

f37b57c

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>

Address review comments

dd66174

gibber9809 requested a review from wraymo May 8, 2024 18:29

wraymo reviewed May 9, 2024

View reviewed changes

Minor style change

ac1d5b1

gibber9809 requested a review from wraymo May 9, 2024 18:39

wraymo previously approved these changes May 9, 2024

View reviewed changes

Improve docstrings for Span and UnalignedSpan classes

2374dfa

gibber9809 requested a review from wraymo May 10, 2024 16:03

gibber9809 added 5 commits May 10, 2024 16:54

Rename UnalignedSpan UnalignedMemSpan

ab35d8c

Upgrade build to use c++20

0800800

Replace uses of our custom Span class with std::span

36c80a5

Fix build issue on MacOS

8dba752

Merge remote-tracking branch 'upstream/main' into clp-core-generic-pa…

21c5611

…rser-changes

wraymo approved these changes May 13, 2024

View reviewed changes

gibber9809 merged commit 3e95aaf into y-scope:main May 13, 2024
11 checks passed

gibber9809 mentioned this pull request May 24, 2024

clp-s: Add support for serializing structured arrays. #413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clp-s: Update core functionality to prepare for generic parser support #355

clp-s: Update core functionality to prepare for generic parser support #355

gibber9809 commented Apr 15, 2024 •

edited

Loading

wraymo left a comment

wraymo Apr 17, 2024

gibber9809 Apr 19, 2024

wraymo Apr 17, 2024

gibber9809 Apr 19, 2024

wraymo Apr 18, 2024

gibber9809 Apr 19, 2024

wraymo Apr 18, 2024

wraymo left a comment

wraymo May 5, 2024

wraymo May 5, 2024

wraymo May 6, 2024

gibber9809 May 8, 2024

wraymo May 6, 2024

wraymo May 6, 2024

wraymo May 7, 2024

gibber9809 May 8, 2024

wraymo May 7, 2024

wraymo May 9, 2024

gibber9809 May 9, 2024

wraymo left a comment


		std::map<int32_t, std::variant<int64_t, double, std::string, uint8_t>> m_extracted_values;
		std::map<int32_t, std::pair<size_t, Span<int32_t>>> m_global_id_to_unordered_object;

		* Finds the root node for a subtree matching a given type given the root node for some subtree
		* in which the subtree we are looking for can be found, and some descendent node of the subtree

		std::unordered_map<int32_t, std::vector<ClpStringColumnReader*>> m_clp_string_readers;
		std::unordered_map<int32_t, std::vector<VariableStringColumnReader*>> m_var_string_readers;

clp-s: Update core functionality to prepare for generic parser support #355

clp-s: Update core functionality to prepare for generic parser support #355

Conversation

gibber9809 commented Apr 15, 2024 • edited Loading

Description

Validation performed

wraymo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wraymo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wraymo left a comment

Choose a reason for hiding this comment

gibber9809 commented Apr 15, 2024 •

edited

Loading