-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2210: [C++][Parquet] Skip pages based on header metadata using a callback #14603
Changes from 12 commits
301c51f
2efce4d
5ad0306
dcc3103
74b25ed
c8608a5
d62cd2e
eb6d046
11a529a
a937528
fe5d844
cc683d6
a436a28
8304db0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -24,6 +24,7 @@ | |||||
|
||||||
#include "parquet/exception.h" | ||||||
#include "parquet/level_conversion.h" | ||||||
#include "parquet/metadata.h" | ||||||
#include "parquet/platform.h" | ||||||
#include "parquet/properties.h" | ||||||
#include "parquet/schema.h" | ||||||
|
@@ -55,6 +56,29 @@ static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; | |||||
// 16 KB is the default expected page header size | ||||||
static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024; | ||||||
|
||||||
// \brief DataPageStats stores encoded statistics and number of values/rows for | ||||||
// a page. | ||||||
struct PARQUET_EXPORT DataPageStats { | ||||||
DataPageStats(EncodedStatistics* encoded_statistics, int32_t num_values, | ||||||
std::optional<int32_t> num_rows) | ||||||
: encoded_statistics(encoded_statistics), | ||||||
is_stats_set(encoded_statistics->is_set()), | ||||||
num_values(num_values), | ||||||
num_rows(num_rows) {} | ||||||
|
||||||
// Encoded statistics extracted from the page header. | ||||||
EncodedStatistics* encoded_statistics; | ||||||
// False if there were no encoded statistics in the page header. | ||||||
bool is_stats_set; | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would encoded_stats just be null here? are both needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I removed is_stats_set, and added a comment that encoded_statistics would be nullptr in that case. Also, added a test for it. |
||||||
// Number of values stored in the page. Filled for both V1 and V2 data pages. | ||||||
// For repeated fields, this can be greater than number of rows. For | ||||||
// non-repeated fields, this will be the same as the number of rows. | ||||||
int32_t num_values; | ||||||
// Number of rows stored in the page. std::nullopt for V1 data pages since | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe rephrase, is that this might not be available for v1 data pages. It still seems like it is possible to set if page indexes are written for v1? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||||||
// num_rows is not stored in format::DataPageHeader. | ||||||
std::optional<int32_t> num_rows; | ||||||
}; | ||||||
|
||||||
class PARQUET_EXPORT LevelDecoder { | ||||||
public: | ||||||
LevelDecoder(); | ||||||
|
@@ -100,6 +124,8 @@ struct CryptoContext { | |||||
// Abstract page iterator interface. This way, we can feed column pages to the | ||||||
// ColumnReader through whatever mechanism we choose | ||||||
class PARQUET_EXPORT PageReader { | ||||||
using DataPageFilter = std::function<bool(const DataPageStats&)>; | ||||||
|
||||||
public: | ||||||
virtual ~PageReader() = default; | ||||||
|
||||||
|
@@ -115,11 +141,27 @@ class PARQUET_EXPORT PageReader { | |||||
bool always_compressed = false, | ||||||
const CryptoContext* ctx = NULLPTR); | ||||||
|
||||||
// If data_page_filter_ is present (not null), NextPage() will call the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||||
// callback function exactly once per page in the order the pages appear in | ||||||
// the column. If the callback function returns true the page will be | ||||||
// skipped. The callback will be called only if the page type is DATA_PAGE or | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this would read better if it was advertised as a page filter (returning true to include a page and false to exclude it). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, but this should then skip the page if the returned value is false, not true. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like since it is a filter, it should return true if it is filtering? But I am not opposed to the other way around, especially if there is a convention for how it should be. |
||||||
// DATA_PAGE_V2. Dictionary pages will not be skipped. | ||||||
// Caller must check DataPageStats.is_stats_set = true before using the filled | ||||||
// encoded_statistics. | ||||||
// \note API EXPERIMENTAL | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add some description of what happens when the file does not contain page statistics (e.g. because the writer did not write them)? Will the filter simply not be called? Or will it be called with some kind of "empty" statistics? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I added a is_stats_set field that should be checked before using the filled encoded_stats. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I removed the is_stats_set, it will return nullptr for encoded_statistics if there are no statistics in the header. I added that to the comments. |
||||||
void set_data_page_filter(DataPageFilter data_page_filter) { | ||||||
data_page_filter_ = std::move(data_page_filter); | ||||||
} | ||||||
|
||||||
// @returns: shared_ptr<Page>(nullptr) on EOS, std::shared_ptr<Page> | ||||||
// containing new Page otherwise | ||||||
virtual std::shared_ptr<Page> NextPage() = 0; | ||||||
|
||||||
virtual void set_max_page_header_size(uint32_t size) = 0; | ||||||
|
||||||
protected: | ||||||
// Callback that decides if we should skip a page or not. | ||||||
DataPageFilter data_page_filter_; | ||||||
}; | ||||||
|
||||||
class PARQUET_EXPORT ColumnReader { | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets clarify when this might be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if I missed it but if we aren't applying business logic to determine when statistics might not be valid, we should clearly document this and how users are expected to verify if they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this should perhaps be
const
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a const.
@emkornfield, I made it clear in the comments for setting the callback function that the caller is responsible to check that statistics are correct.