Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clp-s: Update core functionality to prepare for generic parser support #355

Merged
merged 51 commits into from
May 13, 2024

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Apr 15, 2024

Description

This PR contains a laundry list of changes which in combination pave the way for supporting generic parsers in clp-s. This includes some refactoring and new interfaces to add support for special entries in a schema to denote which columns a given parser is responsible for, as well as changes to allow repeating MPT IDs inside of a schema. The bulk of this change though focuses on refactoring and optimizing clp-s internals.

Early experience with adding support for array-structurization has shown that some parts of clp-s had been written in such an obtuse way that they were hard to extend; for this reason column resolution has been rewritten, and handling of wildcard columns in Output.cpp has been simplified. As well, the changes for array-structurization revealed several performance bugs which this PR addresses.

It should be noted that these optimizations, while targeted at making generic parsers viable, also help baseline clp-s (~1.5x search speedup for large archives for some internal benchmarks).

To summarize the changes:

  • Add interface for custom parsers to easily create the special <parser type, length> tags in the unordered region of the schema
  • Allow repeated MPT node IDs in the unordered region of the schema
  • Rewrite Column Resolution code
  • Partial rewrite for handling pure wildcards in Output.cpp to reduce the number of custom structures they use
  • Change to interface between SchemaReader and Output to handle repeated MPT node IDs
  • Optimizations to improve speed for decompressing schemas with many columns
  • Optimizations on in memory representation of schema tree
  • Optimizations on in memory representation of ColumnReader
  • Optimization to initialize json marshalling data structures only after at least one result has been matched for some table
  • Several optimisations to re-use buffers/memory between loading different tables
  • Change to table metadata section to record decompressed in-memory size of each table
  • New UnalignedSpan and ManagedBufferViewReader classes that allow ColumnReaders to safely share view into a buffer containing all data for a table
  • Remove unused/unnecessary data structures from Output

Validation performed

  • Thorough benchmarking on core datasets after applying array structurization on top of this PR
  • Testing to validate that edge cases for column resolution are handled in a way that is consistent with the previous implementation

	- significant rewrite from previous marshalling approach
	- some edge cases this change introduces aren't handled in
JsonSerializer such that the document description is correct, but some
brackets and commas can be wrong (e.g. [[]] decompresses to []])
	- this change doesn't yet handle documents that are arrays at
the root level
…overhead when a searched schema returns no results
…oid large number of allocations and calls into zstd
@gibber9809 gibber9809 requested a review from wraymo April 15, 2024 20:10
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Left some comments, mostly regarding the style. I will look into the search logic later.

* @param num_messages
* @param should_marshal_records
*/
void reset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we'll reset SchemaReader in this method, do we need to have a non-default constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I had some vague idea that we might want a non-default constructor for a future situation where we want multiple schema readers alive at the same time, but I think it makes sense to have only a default constructor and use this reset method until we need another interface.

components/core/src/clp_s/SchemaReader.hpp Outdated Show resolved Hide resolved
* @param schema_entry
* @return Whether the schema_entry is the delimeter for an unordered object or not
*/
static int32_t schema_entry_is_unordered_object(int32_t schema_entry) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this should be bool.

components/core/src/clp_s/Schema.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/Schema.hpp Outdated Show resolved Hide resolved
/**
* Extracts the unordered object length from an unordered object delimeter.
* @param schema_entry
* @return The extracted NodeType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the return value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right this should be "The extracted object length"

@@ -26,36 +26,42 @@ class FilterClass {
virtual void init(
SchemaReader* reader,
int32_t schema_id,
std::unordered_map<int32_t, BaseColumnReader*>& columns
std::vector<BaseColumnReader*> const& column_readers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the parameter name in the description?

components/core/src/clp_s/Utils.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/Utils.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/Utils.hpp Outdated Show resolved Hide resolved
gibber9809 and others added 2 commits April 19, 2024 10:56
Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
@gibber9809 gibber9809 requested a review from wraymo April 19, 2024 15:55
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. Great work that covers a lot of aspects!

* @param value
*/
inline void add_value(int32_t node_id, std::string const& value) {
template <typename T>
inline void add_value(int32_t node_id, T const& value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add @param node_id?

m_message.emplace(node_id, value);
}

/**
* Adds a boolean value to the message for a given MST node ID.
* Adds a timestamp value and its encoding to the message for a given MST node ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And @param encoding_id?


std::map<int32_t, std::variant<int64_t, double, std::string, uint8_t>> m_extracted_values;
std::map<int32_t, std::pair<size_t, Span<int32_t>>> m_global_id_to_unordered_object;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we use second of it in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is passed to the marshalling code for the relevant generic parser.

}

if (should_marshal_records) {
reader.mark_unordered_object(object_readers_begin, mst_subtree_root_node_id, schema_ids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to change object_readers_begin to something like ...begin_pos or ...begin_offset?

Comment on lines 116 to 117
* Finds the root node for a subtree matching a given type given the root node for some subtree
* in which the subtree we are looking for can be found, and some descendent node of the subtree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write it in this way. But anything that can make it clear is fine.

    /**
     * Finds an ancestor node within a subtree that matches the given type. When multiple matching
     * nodes exist, returns the one closest to the root node of the subtree.
     * @param subtree_root_node The root node of the subtree
     * @param node The node to start searching from
     * @param subtree_type The type of the ancestor node to find
     * @return The ID of the ancestor node if it exists, otherwise -1
     */
    [[nodiscard]] int32_t find_matching_subtree_root_in_subtree(
            int32_t const subtree_root_node,
            int32_t node,
            NodeType type
    ) const;

Comment on lines +81 to +82
std::unordered_map<int32_t, std::vector<ClpStringColumnReader*>> m_clp_string_readers;
std::unordered_map<int32_t, std::vector<VariableStringColumnReader*>> m_var_string_readers;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the vectors used to support duplicate node ids in an unordered region? And I think we can guarantee that columns with the same node id are within the same object and at the same level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is to support duplicate node IDs. They can be in different unordered objects depending on how you look at it, e.g. an unordered object can appear twice inside of an unordered array.

components/core/src/clp_s/search/SchemaMatch.cpp Outdated Show resolved Hide resolved

// Check if the current node is accepted
auto const& cur_node = m_tree->get_node(cur_node_id);
bool empty_key = cur_node.get_key_name().empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about is_key_name_empty?

components/core/src/clp_s/search/SchemaMatch.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/search/SchemaMatch.cpp Outdated Show resolved Hide resolved
gibber9809 and others added 2 commits May 8, 2024 10:49
Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
@gibber9809 gibber9809 requested a review from wraymo May 8, 2024 18:29
|| m_match.schema_searches_against_column(schema_id, column_id))
{
ClpStringColumnReader* clp_reader = dynamic_cast<ClpStringColumnReader*>(column_reader);
VariableStringColumnReader* var_reader
= dynamic_cast<VariableStringColumnReader*>(column_reader);
DateStringColumnReader* date_reader
= dynamic_cast<DateStringColumnReader*>(column_reader);
if (clp_reader != nullptr && clp_reader->get_type() == NodeType::ClpString) {
m_clp_string_readers[column_id].push_back(clp_reader);
} else if (var_reader != nullptr && var_reader->get_type() == NodeType::VarString) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to change clp_reader != nullptr and var_reader != nullptr to nullptr != clpreader and nullptr != var_reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, might as well

@gibber9809 gibber9809 requested a review from wraymo May 9, 2024 18:39
wraymo
wraymo previously approved these changes May 9, 2024
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message looks good to me. @kirkrodrigues Do you want to take a look at this PR?

@wraymo wraymo dismissed their stale review May 9, 2024 20:36

Needs some discussion on the span classes

@gibber9809 gibber9809 requested a review from wraymo May 10, 2024 16:03
@gibber9809 gibber9809 merged commit 3e95aaf into y-scope:main May 13, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants