-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2210: [C++][Parquet] Skip pages based on header metadata using a callback #14603
Conversation
header metadata using a callback.
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
@emkornfield, @pitrou could you take a look at this? Thanks! |
I think to avoid exposing thrift headers we need an alternate representation of stats. Please take a look at how RowGroup stats are exposed and re-use or model the necessary structures off of that. |
|
|
cpp/src/parquet/column_reader.cc
Outdated
bool set_skip_page_callback( | ||
std::function<bool(const format::PageHeader&)> skip_page_callback) override { | ||
skip_page_callback_ = skip_page_callback; | ||
has_skip_page_callback_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid the has_skip_page_callback_ flag by simply checking if (skip_page_callback_)
https://en.cppreference.com/w/cpp/utility/functional/function/operator_bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/column_reader.h
Outdated
@@ -28,6 +28,7 @@ | |||
#include "parquet/properties.h" | |||
#include "parquet/schema.h" | |||
#include "parquet/types.h" | |||
#include "parquet/thrift_internal.h" // IWYU pragma: keep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might fall the CI check due to exposing thrift symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/column_reader.cc
Outdated
// Once we have the header, we will call the skip_page_call_back_ to | ||
// determine if we should be skipping this page. If yes, we will advance the | ||
// stream to the next page. | ||
if(has_skip_page_callback_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we have done is to let the PageReader be aware of Offset Index belong to the pages of the RowGroup. In this way, it can be smart enough to move among pages and handle the skip logic. The prerequisite is https://issues.apache.org/jira/browse/ARROW-10158. If that sounds good, I can pick up ARROW-10158 to contribute our implementation. WDYT? @fatemehp @emkornfield @pitrou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we have done is to let the PageReader be aware of Offset Index belong to the pages of the RowGroup
Do you have example code here for what you mean?
I can pick up ARROW-10158 to contribute our implementation.
Is the implementation compatible with the callback approach? If you are willing to contribute it, it seems like it would be valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wgtmac. With this pull request, we would like to specifically support the case where the Page Index is NOT available. We still want to be able to skip pages based on the metadata in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we have done is to let the PageReader be aware of Offset Index belong to the pages of the RowGroup
Do you have example code here for what you mean?
I can pick up ARROW-10158 to contribute our implementation.
Is the implementation compatible with the callback approach? If you are willing to contribute it, it seems like it would be valuable.
I agree with what @fatemehp has said. My proposal is orthogonal (and compatible) to the current patch. I can contribute the Page Index implementation and then utilize the Offset Index to skip when available.
cpp/src/parquet/column_reader.h
Outdated
@@ -115,6 +116,17 @@ class PARQUET_EXPORT PageReader { | |||
bool always_compressed = false, | |||
const CryptoContext* ctx = NULLPTR); | |||
|
|||
// @returns: true if the skip callback was successfully set. Returns false | |||
// if the callback was not set or not supported. | |||
// If supported, NextPage() will use this callback to determine if it should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there contexts where this wouldn't be supported? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have a concrete example. Since this is an abstract class, I tried to make it a more general case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't have a failure case here in mind, then lets either make the method void (YAGNI) if we think there is a possible error, then Status should be returned. Similar question is does this need to be virtual?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This pull request intends to support skipping pages using the metadata. This is especially useful when page indexes are NOT present. Even so, We still need to think through the case that the page index is present. Here is what I think. When we have a page index, we will probably have a separate API to skip to a page at a particular offset. If a callback is defined, it would take effect via the NextPage calls. So we skip to a page at an offset, and when we call NextPage, it will return the next eligible page. |
format::DataPageHeader and format::DataPageHeaderV2.
@pitrou, @emkornfield I have updated this pull request and added more tests. Please take a look. |
cpp/src/parquet/metadata.h
Outdated
|
||
bool Equals(const DataPageStats& other) const; | ||
|
||
int32_t num_values() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a notion of statistics (see statistics.h), is there a reason to not use that as a member here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statistics has more functionality/dependency than we need. It is meant for accumulating values and computing the min/max. Also, we need a column descriptor to make it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EncodedStatistics looks almost identical to this though? For normal statistics It looks the like the descriptor is only needed to construct a comparator, I wonder if there is some refactoring that could separate the two? It seems like having a 3rd type of statitistics would add confusion, so I want to make sure there is a strong reason to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could use EncodedStatistics, it is missing num_rows and num_values though. Should we refactor EncodedStatistics to meet our needs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slightly preference for composing encoded stats into this new object, rather than refactoring. Do you have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this part to use encoded stats. Please take a look.
cpp/src/parquet/metadata.h
Outdated
bool Equals(const DataPageStats& other) const; | ||
|
||
int32_t num_values() const; | ||
int32_t num_rows() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is the consumer supposed to know if this is exact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this throws an exception if isn't data table V2, the user has no way of knowing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should expose is_v2_ as well. I will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_rows returns an optional now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, and valuable information that might not be obvious to a parquet novice. Can you add this fact to the comments for this addition somewhere?
I added this to the comments in the DataPageStats class. please take a look.
cpp/src/parquet/column_reader.cc
Outdated
DataPageStats::Make(&data_page_header); | ||
if (skip_page_callback_(data_page_stats.get())) { | ||
PARQUET_THROW_NOT_OK(stream_->Advance(compressed_len)); | ||
return NextPage(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it pay to refactor this a little bit so we can use iteration instead of recursion? (I think it would be good to avoid stack overflow for poorly constructed files that we filter out most pages for)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/column_reader.cc
Outdated
} else { | ||
data_page_header = current_page_header_.data_page_header_v2; | ||
} | ||
std::unique_ptr<DataPageStats> data_page_stats = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one other thing to consider is whther statistics are correct:
arrow/cpp/src/parquet/metadata.cc
Line 1325 in 147b5c9
bool ApplicationVersion::HasCorrectStatistics(Type::type col_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the caller should check that the stats are valid before setting a callback function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made that clear in the comments for setting the callback function.
cpp/src/parquet/column_reader.cc
Outdated
seen_num_values_ += header.num_values; | ||
if (skip_page_callback_) { | ||
DataPageStats data_page_stats(page_statistics, header.num_values, | ||
header.num_rows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should num_rows be validated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/column_reader.cc
Outdated
} | ||
} else if (page_type == PageType::DATA_PAGE_V2) { | ||
const format::DataPageHeaderV2& header = current_page_header_.data_page_header_v2; | ||
if (header.num_values < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does num_rows have to be validated as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/column_reader.cc
Outdated
} else if (page_type == PageType::DICTIONARY_PAGE) { | ||
const format::DictionaryPageHeader& dict_header = | ||
current_page_header_.dictionary_page_header; | ||
if (dict_header.num_values < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are refactoring consider making this check a utility method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/column_reader.cc
Outdated
@@ -386,6 +386,57 @@ std::shared_ptr<Page> SerializedPageReader::NextPage() { | |||
throw ParquetException("Invalid page header"); | |||
} | |||
|
|||
// Do some checks before trying to decrypt and/or decompress the page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are refactoring, would it make sense to move some of the logic to a helper method to shorten this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/column_reader.h
Outdated
@@ -115,11 +116,26 @@ class PARQUET_EXPORT PageReader { | |||
bool always_compressed = false, | |||
const CryptoContext* ctx = NULLPTR); | |||
|
|||
// If skip_page_callback_ is set, NextPage() must use this callback to determine if it | |||
// should return or skip and move to the next page. If the callback function returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to clarify the contract here further:
- Pages will be called in the order they appear in the column?
- The callback will be called at most once per page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- set_skip_page_callback is expected to be called at most once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/column_reader.h
Outdated
@@ -115,11 +116,26 @@ class PARQUET_EXPORT PageReader { | |||
bool always_compressed = false, | |||
const CryptoContext* ctx = NULLPTR); | |||
|
|||
// If skip_page_callback_ is set, NextPage() must use this callback to determine if it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If skip_page_callback_ is set, NextPage() must use this callback to determine if it | |
// If skip_page_callback is present (not null), NextPage() must use this callback to determine if it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cpp/src/parquet/column_reader.h
Outdated
// \note API EXPERIMENTAL | ||
void set_skip_page_callback( | ||
std::function<bool(const DataPageStats&)> skip_page_callback) { | ||
skip_page_callback_ = skip_page_callback; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: std::move ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Actually, nevermind. As long as we have the number of rows (which I think requires v2?) we can just have our callback record the skipped ranges (e.g. skipped from 50-60). That should be sufficient. |
cpp/src/parquet/column_reader.h
Outdated
@@ -55,6 +56,21 @@ static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; | |||
// 16 KB is the default expected page header size | |||
static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024; | |||
|
|||
// \brief DataPageStats is a proxy around format::DataPageHeader and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment seems inaccurate since it does not have access to all fields in the thrift page header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/column_reader.h
Outdated
@@ -55,6 +56,21 @@ static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024; | |||
// 16 KB is the default expected page header size | |||
static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024; | |||
|
|||
// \brief DataPageStats is a proxy around format::DataPageHeader and | |||
// format::DataPageHeaderV2. | |||
class PARQUET_EXPORT DataPageStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply use a struct and remove the ctor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a struct, though kept the ctor for convenience of making an instance.
|
||
auto skip_all_pages = [](const DataPageStats& stats) -> bool { return true; }; | ||
|
||
page_reader_->set_data_page_filter(skip_all_pages); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to add an e2e case to implement a callback based on the stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate?
I considered adding a test in column_reader_test, however, the test works with a MockPageReader so I am not sure how much the added test using the mock reader would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this could be done by writing a golden pseudo-random file, and then specifiying a filter that should filter some of the pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would probably OK to do this as a follow up though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACKed.
Yes, we can filter based on this callback for v1 non-repeated fields, for which the number of values equals the number of rows, and v2 fields. |
This is great, and valuable information that might not be obvious to a parquet novice. Can you add this fact to the comments for this addition somewhere? |
After looking into it more, Encoded statistics is not directly convertible to statistics because it is missing the number of not-null values. We could add a HasNumValues() function to tell when we have that or not, but I find that it makes the Statistics class confusing. I prefer holding off on this until future pull requests in which we will actually be using such conversion. |
That doesn't prevent us from exposing a convenience API to get a |
when there is no encoded stats in the page header.
@pitrou, I added a helper function, can you take a look? Is this what you were thinking? |
Done. Added to the DataPageStats comments. |
@emkornfield, @pitrou, @wgtmac,@westonpace I attempted to address your comments, please take a look. |
cpp/src/parquet/column_reader.h
Outdated
num_rows(num_rows) {} | ||
|
||
// Encoded statistics extracted from the page header. | ||
EncodedStatistics* encoded_statistics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets clarify when this might be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if I missed it but if we aren't applying business logic to determine when statistics might not be valid, we should clearly document this and how users are expected to verify if they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this should perhaps be const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a const.
@emkornfield, I made it clear in the comments for setting the callback function that the caller is responsible to check that statistics are correct.
cpp/src/parquet/column_reader.h
Outdated
// Encoded statistics extracted from the page header. | ||
EncodedStatistics* encoded_statistics; | ||
// False if there were no encoded statistics in the page header. | ||
bool is_stats_set; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would encoded_stats just be null here? are both needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I removed is_stats_set, and added a comment that encoded_statistics would be nullptr in that case. Also, added a test for it.
cpp/src/parquet/column_reader.h
Outdated
@@ -115,11 +141,27 @@ class PARQUET_EXPORT PageReader { | |||
bool always_compressed = false, | |||
const CryptoContext* ctx = NULLPTR); | |||
|
|||
// If data_page_filter_ is present (not null), NextPage() will call the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If data_page_filter_ is present (not null), NextPage() will call the | |
// If data_page_filter is present (not null), NextPage() will call the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems page stats might be redundant with if set, and I think we need to clarify in the contract to let uses know some pages might not be valid.
Otherwise functionally this seems right to me. Thank you for incorporating all the feedback.
cpp/src/parquet/statistics.h
Outdated
static std::shared_ptr<Statistics> Make( | ||
const ColumnDescriptor* descr, const EncodedStatistics* encoded_statistics, | ||
::arrow::MemoryPool* pool = ::arrow::default_memory_pool()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's allow passing num_values
optionally?
static std::shared_ptr<Statistics> Make( | |
const ColumnDescriptor* descr, const EncodedStatistics* encoded_statistics, | |
::arrow::MemoryPool* pool = ::arrow::default_memory_pool()); | |
static std::shared_ptr<Statistics> Make( | |
const ColumnDescriptor* descr, const EncodedStatistics* encoded_statistics, | |
int64_t num_values = -1, ::arrow::MemoryPool* pool = ::arrow::default_memory_pool()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/metadata.h
Outdated
#include <string> | ||
#include <utility> | ||
#include <vector> | ||
|
||
#include "parquet/platform.h" | ||
#include "parquet/properties.h" | ||
#include "parquet/schema.h" | ||
#include "parquet/statistics.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You said "done" but didn't address this?
cpp/src/parquet/metadata.cc
Outdated
@@ -19,6 +19,8 @@ | |||
|
|||
#include <algorithm> | |||
#include <cinttypes> | |||
#include <csignal> | |||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are csignal
and iostream
needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
||
// Skip pages with even number of values. | ||
auto skip_even_pages = [](const DataPageStats& stats) -> bool { | ||
if (stats.num_values % 2 == 0) return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test something about the page statistics in stats
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I added checks for stats.
this->data_page_header_.statistics.__set_min_value("A"); | ||
this->data_page_header_.statistics.__set_max_value("Z"); | ||
this->data_page_header_.statistics.__set_null_count(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to vary the statistics a bit from page to page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cpp/src/parquet/column_reader.h
Outdated
num_rows(num_rows) {} | ||
|
||
// Encoded statistics extracted from the page header. | ||
EncodedStatistics* encoded_statistics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this should perhaps be const
?
// If skip_page_callback_ is present (not null), NextPage() will call the | ||
// callback function exactly once per page in the order the pages appear in | ||
// the column. If the callback function returns true the page will be | ||
// skipped. The callback will be called only if the page type is DATA_PAGE or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but this should then skip the page if the returned value is false, not true.
@pitrou could you take another look? |
@emkornfield could you take another look? |
cpp/src/parquet/column_reader.h
Outdated
// For repeated fields, this can be greater than number of rows. For | ||
// non-repeated fields, this will be the same as the number of rows. | ||
int32_t num_values; | ||
// Number of rows stored in the page. std::nullopt for V1 data pages since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rephrase, is that this might not be available for v1 data pages. It still seems like it is possible to set if page indexes are written for v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
std::vector<int64_t> read_num_values; | ||
std::vector<std::optional<int32_t>> read_num_rows; | ||
auto read_all_pages = [&](const DataPageStats& stats) -> bool { | ||
DCHECK_NE(stats.encoded_statistics, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ASSERT_NE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot use ASSERT inside the lambda function definition.
// passed using the num_values parameter. | ||
static std::shared_ptr<Statistics> Make( | ||
const ColumnDescriptor* descr, const EncodedStatistics* encoded_statistics, | ||
int64_t num_values = -1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand why this is optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to add a convenient way to convert an EncodedStatistics to Statistics. However, the encoded statistics misses the number of non-null values that is required for making Statistics. So, I made an optional argument for it.
Please see here: #14603 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from me. @pitrou @westonpace @wgtmac any more concerns?
Apparently you requested some changes? Feel free to merge when you feel everything was addressed. |
Same here, I am content |
Travis CI failures look infrastructure related. going to merge. |
Benchmark runs are scheduled for baseline = 37a7965 and contender = 97998d8. 97998d8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Currently, we do not expose the page header metadata and they cannot be used for skipping pages. I propose exposing the metadata through a callback that would allow the caller to decide if they want to read or skip the page based on the metadata.