-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clp-s: Update core functionality to prepare for generic parser support #355
clp-s: Update core functionality to prepare for generic parser support #355
Conversation
…rrays were not recorded
- significant rewrite from previous marshalling approach - some edge cases this change introduces aren't handled in JsonSerializer such that the document description is correct, but some brackets and commas can be wrong (e.g. [[]] decompresses to []]) - this change doesn't yet handle documents that are arrays at the root level
…emaMatch work for structured arrays
…ing new memory each time
…overhead when a searched schema returns no results
…d to avoid re-allocating internal buffer
…oid large number of allocations and calls into zstd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Left some comments, mostly regarding the style. I will look into the search logic later.
* @param num_messages | ||
* @param should_marshal_records | ||
*/ | ||
void reset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we'll reset SchemaReader in this method, do we need to have a non-default constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I had some vague idea that we might want a non-default constructor for a future situation where we want multiple schema readers alive at the same time, but I think it makes sense to have only a default constructor and use this reset method until we need another interface.
components/core/src/clp_s/Schema.hpp
Outdated
* @param schema_entry | ||
* @return Whether the schema_entry is the delimeter for an unordered object or not | ||
*/ | ||
static int32_t schema_entry_is_unordered_object(int32_t schema_entry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this should be bool.
components/core/src/clp_s/Schema.hpp
Outdated
/** | ||
* Extracts the unordered object length from an unordered object delimeter. | ||
* @param schema_entry | ||
* @return The extracted NodeType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the return value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right this should be "The extracted object length"
@@ -26,36 +26,42 @@ class FilterClass { | |||
virtual void init( | |||
SchemaReader* reader, | |||
int32_t schema_id, | |||
std::unordered_map<int32_t, BaseColumnReader*>& columns | |||
std::vector<BaseColumnReader*> const& column_readers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the parameter name in the description?
Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. Great work that covers a lot of aspects!
* @param value | ||
*/ | ||
inline void add_value(int32_t node_id, std::string const& value) { | ||
template <typename T> | ||
inline void add_value(int32_t node_id, T const& value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add @param node_id
?
m_message.emplace(node_id, value); | ||
} | ||
|
||
/** | ||
* Adds a boolean value to the message for a given MST node ID. | ||
* Adds a timestamp value and its encoding to the message for a given MST node ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And @param encoding_id
?
|
||
std::map<int32_t, std::variant<int64_t, double, std::string, uint8_t>> m_extracted_values; | ||
std::map<int32_t, std::pair<size_t, Span<int32_t>>> m_global_id_to_unordered_object; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we use second
of it in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is passed to the marshalling code for the relevant generic parser.
} | ||
|
||
if (should_marshal_records) { | ||
reader.mark_unordered_object(object_readers_begin, mst_subtree_root_node_id, schema_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to change object_readers_begin
to something like ...begin_pos
or ...begin_offset
?
* Finds the root node for a subtree matching a given type given the root node for some subtree | ||
* in which the subtree we are looking for can be found, and some descendent node of the subtree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write it in this way. But anything that can make it clear is fine.
/**
* Finds an ancestor node within a subtree that matches the given type. When multiple matching
* nodes exist, returns the one closest to the root node of the subtree.
* @param subtree_root_node The root node of the subtree
* @param node The node to start searching from
* @param subtree_type The type of the ancestor node to find
* @return The ID of the ancestor node if it exists, otherwise -1
*/
[[nodiscard]] int32_t find_matching_subtree_root_in_subtree(
int32_t const subtree_root_node,
int32_t node,
NodeType type
) const;
std::unordered_map<int32_t, std::vector<ClpStringColumnReader*>> m_clp_string_readers; | ||
std::unordered_map<int32_t, std::vector<VariableStringColumnReader*>> m_var_string_readers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the vectors used to support duplicate node ids in an unordered region? And I think we can guarantee that columns with the same node id are within the same object and at the same level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is to support duplicate node IDs. They can be in different unordered objects depending on how you look at it, e.g. an unordered object can appear twice inside of an unordered array.
|
||
// Check if the current node is accepted | ||
auto const& cur_node = m_tree->get_node(cur_node_id); | ||
bool empty_key = cur_node.get_key_name().empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about is_key_name_empty
?
Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
|| m_match.schema_searches_against_column(schema_id, column_id)) | ||
{ | ||
ClpStringColumnReader* clp_reader = dynamic_cast<ClpStringColumnReader*>(column_reader); | ||
VariableStringColumnReader* var_reader | ||
= dynamic_cast<VariableStringColumnReader*>(column_reader); | ||
DateStringColumnReader* date_reader | ||
= dynamic_cast<DateStringColumnReader*>(column_reader); | ||
if (clp_reader != nullptr && clp_reader->get_type() == NodeType::ClpString) { | ||
m_clp_string_readers[column_id].push_back(clp_reader); | ||
} else if (var_reader != nullptr && var_reader->get_type() == NodeType::VarString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to change clp_reader != nullptr
and var_reader != nullptr
to nullptr != clpreader
and nullptr != var_reader
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, might as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message looks good to me. @kirkrodrigues Do you want to take a look at this PR?
Description
This PR contains a laundry list of changes which in combination pave the way for supporting generic parsers in clp-s. This includes some refactoring and new interfaces to add support for special entries in a schema to denote which columns a given parser is responsible for, as well as changes to allow repeating MPT IDs inside of a schema. The bulk of this change though focuses on refactoring and optimizing clp-s internals.
Early experience with adding support for array-structurization has shown that some parts of clp-s had been written in such an obtuse way that they were hard to extend; for this reason column resolution has been rewritten, and handling of wildcard columns in Output.cpp has been simplified. As well, the changes for array-structurization revealed several performance bugs which this PR addresses.
It should be noted that these optimizations, while targeted at making generic parsers viable, also help baseline clp-s (~1.5x search speedup for large archives for some internal benchmarks).
To summarize the changes:
Validation performed