Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2210: [C++][Parquet] Skip pages based on header metadata using a callback #14603

Merged
merged 14 commits into from
Jan 12, 2023

Conversation

fatemehp
Copy link
Contributor

@fatemehp fatemehp commented Nov 7, 2022

Currently, we do not expose the page header metadata and they cannot be used for skipping pages. I propose exposing the metadata through a callback that would allow the caller to decide if they want to read or skip the page based on the metadata.

@github-actions
Copy link

github-actions bot commented Nov 7, 2022

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@fatemehp
Copy link
Contributor Author

fatemehp commented Nov 7, 2022

@emkornfield, @pitrou could you take a look at this? Thanks!

@emkornfield
Copy link
Contributor

I think to avoid exposing thrift headers we need an alternate representation of stats. Please take a look at how RowGroup stats are exposed and re-use or model the necessary structures off of that.

@kou kou changed the title [PARQUET-2210]:[cpp] Skip pages based on header metadata using a callback PARQUET-2210: [C++][Parquet] Skip pages based on header metadata using a callback Nov 8, 2022
@github-actions
Copy link

github-actions bot commented Nov 8, 2022

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

⚠️ Ticket has no components in JIRA, make sure you assign one.

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

bool set_skip_page_callback(
std::function<bool(const format::PageHeader&)> skip_page_callback) override {
skip_page_callback_ = skip_page_callback;
has_skip_page_callback_ = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid the has_skip_page_callback_ flag by simply checking if (skip_page_callback_)

https://en.cppreference.com/w/cpp/utility/functional/function/operator_bool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -28,6 +28,7 @@
#include "parquet/properties.h"
#include "parquet/schema.h"
#include "parquet/types.h"
#include "parquet/thrift_internal.h" // IWYU pragma: keep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might fall the CI check due to exposing thrift symbols.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Once we have the header, we will call the skip_page_call_back_ to
// determine if we should be skipping this page. If yes, we will advance the
// stream to the next page.
if(has_skip_page_callback_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we have done is to let the PageReader be aware of Offset Index belong to the pages of the RowGroup. In this way, it can be smart enough to move among pages and handle the skip logic. The prerequisite is https://issues.apache.org/jira/browse/ARROW-10158. If that sounds good, I can pick up ARROW-10158 to contribute our implementation. WDYT? @fatemehp @emkornfield @pitrou

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we have done is to let the PageReader be aware of Offset Index belong to the pages of the RowGroup

Do you have example code here for what you mean?

I can pick up ARROW-10158 to contribute our implementation.

Is the implementation compatible with the callback approach? If you are willing to contribute it, it seems like it would be valuable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wgtmac. With this pull request, we would like to specifically support the case where the Page Index is NOT available. We still want to be able to skip pages based on the metadata in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we have done is to let the PageReader be aware of Offset Index belong to the pages of the RowGroup

Do you have example code here for what you mean?

I can pick up ARROW-10158 to contribute our implementation.

Is the implementation compatible with the callback approach? If you are willing to contribute it, it seems like it would be valuable.

I agree with what @fatemehp has said. My proposal is orthogonal (and compatible) to the current patch. I can contribute the Page Index implementation and then utilize the Offset Index to skip when available.

@@ -115,6 +116,17 @@ class PARQUET_EXPORT PageReader {
bool always_compressed = false,
const CryptoContext* ctx = NULLPTR);

// @returns: true if the skip callback was successfully set. Returns false
// if the callback was not set or not supported.
// If supported, NextPage() will use this callback to determine if it should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there contexts where this wouldn't be supported? Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have a concrete example. Since this is an abstract class, I tried to make it a more general case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't have a failure case here in mind, then lets either make the method void (YAGNI) if we think there is a possible error, then Status should be returned. Similar question is does this need to be virtual?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@fatemehp
Copy link
Contributor Author

fatemehp commented Nov 9, 2022

This pull request intends to support skipping pages using the metadata. This is especially useful when page indexes are NOT present. Even so, We still need to think through the case that the page index is present.

Here is what I think. When we have a page index, we will probably have a separate API to skip to a page at a particular offset. If a callback is defined, it would take effect via the NextPage calls. So we skip to a page at an offset, and when we call NextPage, it will return the next eligible page.

format::DataPageHeader and format::DataPageHeaderV2.
@fatemehp
Copy link
Contributor Author

@pitrou, @emkornfield I have updated this pull request and added more tests. Please take a look.


bool Equals(const DataPageStats& other) const;

int32_t num_values() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a notion of statistics (see statistics.h), is there a reason to not use that as a member here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statistics has more functionality/dependency than we need. It is meant for accumulating values and computing the min/max. Also, we need a column descriptor to make it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EncodedStatistics looks almost identical to this though? For normal statistics It looks the like the descriptor is only needed to construct a comparator, I wonder if there is some refactoring that could separate the two? It seems like having a 3rd type of statitistics would add confusion, so I want to make sure there is a strong reason to add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could use EncodedStatistics, it is missing num_rows and num_values though. Should we refactor EncodedStatistics to meet our needs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slightly preference for composing encoded stats into this new object, rather than refactoring. Do you have a preference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this part to use encoded stats. Please take a look.

bool Equals(const DataPageStats& other) const;

int32_t num_values() const;
int32_t num_rows() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is the consumer supposed to know if this is exact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this throws an exception if isn't data table V2, the user has no way of knowing that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should expose is_v2_ as well. I will do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_rows returns an optional now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, and valuable information that might not be obvious to a parquet novice. Can you add this fact to the comments for this addition somewhere?

I added this to the comments in the DataPageStats class. please take a look.

DataPageStats::Make(&data_page_header);
if (skip_page_callback_(data_page_stats.get())) {
PARQUET_THROW_NOT_OK(stream_->Advance(compressed_len));
return NextPage();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it pay to refactor this a little bit so we can use iteration instead of recursion? (I think it would be good to avoid stack overflow for poorly constructed files that we filter out most pages for)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else {
data_page_header = current_page_header_.data_page_header_v2;
}
std::unique_ptr<DataPageStats> data_page_stats =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one other thing to consider is whther statistics are correct:

bool ApplicationVersion::HasCorrectStatistics(Type::type col_type,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the caller should check that the stats are valid before setting a callback function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made that clear in the comments for setting the callback function.

seen_num_values_ += header.num_values;
if (skip_page_callback_) {
DataPageStats data_page_stats(page_statistics, header.num_values,
header.num_rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should num_rows be validated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
} else if (page_type == PageType::DATA_PAGE_V2) {
const format::DataPageHeaderV2& header = current_page_header_.data_page_header_v2;
if (header.num_values < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does num_rows have to be validated as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} else if (page_type == PageType::DICTIONARY_PAGE) {
const format::DictionaryPageHeader& dict_header =
current_page_header_.dictionary_page_header;
if (dict_header.num_values < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are refactoring consider making this check a utility method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -386,6 +386,57 @@ std::shared_ptr<Page> SerializedPageReader::NextPage() {
throw ParquetException("Invalid page header");
}

// Do some checks before trying to decrypt and/or decompress the page.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are refactoring, would it make sense to move some of the logic to a helper method to shorten this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -115,11 +116,26 @@ class PARQUET_EXPORT PageReader {
bool always_compressed = false,
const CryptoContext* ctx = NULLPTR);

// If skip_page_callback_ is set, NextPage() must use this callback to determine if it
// should return or skip and move to the next page. If the callback function returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to clarify the contract here further:

  1. Pages will be called in the order they appear in the column?
  2. The callback will be called at most once per page?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. set_skip_page_callback is expected to be called at most once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -115,11 +116,26 @@ class PARQUET_EXPORT PageReader {
bool always_compressed = false,
const CryptoContext* ctx = NULLPTR);

// If skip_page_callback_ is set, NextPage() must use this callback to determine if it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If skip_page_callback_ is set, NextPage() must use this callback to determine if it
// If skip_page_callback is present (not null), NextPage() must use this callback to determine if it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// \note API EXPERIMENTAL
void set_skip_page_callback(
std::function<bool(const DataPageStats&)> skip_page_callback) {
skip_page_callback_ = skip_page_callback;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::move ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@westonpace
Copy link
Member

We could end up with something like 70 rows for column0 (because we skipped 3 pages) and 100 rows for column1 and column2. However, we would not know which rows to drop from column1 and column2 (the first 30? the last 30? somewhere in the middle?)

Actually, nevermind. As long as we have the number of rows (which I think requires v2?) we can just have our callback record the skipped ranges (e.g. skipped from 50-60). That should be sufficient.

@@ -55,6 +56,21 @@ static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;
// 16 KB is the default expected page header size
static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024;

// \brief DataPageStats is a proxy around format::DataPageHeader and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment seems inaccurate since it does not have access to all fields in the thrift page header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -55,6 +56,21 @@ static constexpr uint32_t kDefaultMaxPageHeaderSize = 16 * 1024 * 1024;
// 16 KB is the default expected page header size
static constexpr uint32_t kDefaultPageHeaderSize = 16 * 1024;

// \brief DataPageStats is a proxy around format::DataPageHeader and
// format::DataPageHeaderV2.
class PARQUET_EXPORT DataPageStats {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply use a struct and remove the ctor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a struct, though kept the ctor for convenience of making an instance.


auto skip_all_pages = [](const DataPageStats& stats) -> bool { return true; };

page_reader_->set_data_page_filter(skip_all_pages);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to add an e2e case to implement a callback based on the stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate?

I considered adding a test in column_reader_test, however, the test works with a MockPageReader so I am not sure how much the added test using the mock reader would be helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be done by writing a golden pseudo-random file, and then specifiying a filter that should filter some of the pages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would probably OK to do this as a follow up though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACKed.

@fatemehp
Copy link
Contributor Author

fatemehp commented Dec 9, 2022

Actually, nevermind. As long as we have the number of rows (which I think requires v2?) we can just have our callback record the skipped ranges (e.g. skipped from 50-60). That should be sufficient.

Yes, we can filter based on this callback for v1 non-repeated fields, for which the number of values equals the number of rows, and v2 fields.

@westonpace
Copy link
Member

for which the number of values equals the number of rows

This is great, and valuable information that might not be obvious to a parquet novice. Can you add this fact to the comments for this addition somewhere?

@fatemehp
Copy link
Contributor Author

After looking into it more, Encoded statistics is not directly convertible to statistics because it is missing the number of not-null values. We could add a HasNumValues() function to tell when we have that or not, but I find that it makes the Statistics class confusing.

I prefer holding off on this until future pull requests in which we will actually be using such conversion.

@pitrou
Copy link
Member

pitrou commented Dec 12, 2022

After looking into it more, Encoded statistics is not directly convertible to statistics because it is missing the number of not-null values.

That doesn't prevent us from exposing a convenience API to get a Statistics from EncodedStatistics.

@fatemehp
Copy link
Contributor Author

That doesn't prevent us from exposing a convenience API to get a Statistics from EncodedStatistics

@pitrou, I added a helper function, can you take a look? Is this what you were thinking?

@fatemehp
Copy link
Contributor Author

This is great, and valuable information that might not be obvious to a parquet novice. Can you add this fact to the comments for this addition somewhere?

Done. Added to the DataPageStats comments.

@fatemehp
Copy link
Contributor Author

@emkornfield, @pitrou, @wgtmac,@westonpace I attempted to address your comments, please take a look.

num_rows(num_rows) {}

// Encoded statistics extracted from the page header.
EncodedStatistics* encoded_statistics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets clarify when this might be null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if I missed it but if we aren't applying business logic to determine when statistics might not be valid, we should clearly document this and how users are expected to verify if they are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should perhaps be const?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a const.

@emkornfield, I made it clear in the comments for setting the callback function that the caller is responsible to check that statistics are correct.

// Encoded statistics extracted from the page header.
EncodedStatistics* encoded_statistics;
// False if there were no encoded statistics in the page header.
bool is_stats_set;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would encoded_stats just be null here? are both needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I removed is_stats_set, and added a comment that encoded_statistics would be nullptr in that case. Also, added a test for it.

@@ -115,11 +141,27 @@ class PARQUET_EXPORT PageReader {
bool always_compressed = false,
const CryptoContext* ctx = NULLPTR);

// If data_page_filter_ is present (not null), NextPage() will call the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If data_page_filter_ is present (not null), NextPage() will call the
// If data_page_filter is present (not null), NextPage() will call the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems page stats might be redundant with if set, and I think we need to clarify in the contract to let uses know some pages might not be valid.

Otherwise functionally this seems right to me. Thank you for incorporating all the feedback.

Comment on lines 222 to 224
static std::shared_ptr<Statistics> Make(
const ColumnDescriptor* descr, const EncodedStatistics* encoded_statistics,
::arrow::MemoryPool* pool = ::arrow::default_memory_pool());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's allow passing num_values optionally?

Suggested change
static std::shared_ptr<Statistics> Make(
const ColumnDescriptor* descr, const EncodedStatistics* encoded_statistics,
::arrow::MemoryPool* pool = ::arrow::default_memory_pool());
static std::shared_ptr<Statistics> Make(
const ColumnDescriptor* descr, const EncodedStatistics* encoded_statistics,
int64_t num_values = -1, ::arrow::MemoryPool* pool = ::arrow::default_memory_pool());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#include <string>
#include <utility>
#include <vector>

#include "parquet/platform.h"
#include "parquet/properties.h"
#include "parquet/schema.h"
#include "parquet/statistics.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said "done" but didn't address this?

@@ -19,6 +19,8 @@

#include <algorithm>
#include <cinttypes>
#include <csignal>
#include <iostream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are csignal and iostream needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


// Skip pages with even number of values.
auto skip_even_pages = [](const DataPageStats& stats) -> bool {
if (stats.num_values % 2 == 0) return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test something about the page statistics in stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I added checks for stats.

Comment on lines 225 to 227
this->data_page_header_.statistics.__set_min_value("A");
this->data_page_header_.statistics.__set_max_value("Z");
this->data_page_header_.statistics.__set_null_count(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to vary the statistics a bit from page to page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

num_rows(num_rows) {}

// Encoded statistics extracted from the page header.
EncodedStatistics* encoded_statistics;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should perhaps be const?

// If skip_page_callback_ is present (not null), NextPage() will call the
// callback function exactly once per page in the order the pages appear in
// the column. If the callback function returns true the page will be
// skipped. The callback will be called only if the page type is DATA_PAGE or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but this should then skip the page if the returned value is false, not true.

@fatemehp
Copy link
Contributor Author

@pitrou could you take another look?

@fatemehp
Copy link
Contributor Author

@emkornfield could you take another look?

// For repeated fields, this can be greater than number of rows. For
// non-repeated fields, this will be the same as the number of rows.
int32_t num_values;
// Number of rows stored in the page. std::nullopt for V1 data pages since
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rephrase, is that this might not be available for v1 data pages. It still seems like it is possible to set if page indexes are written for v1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::vector<int64_t> read_num_values;
std::vector<std::optional<int32_t>> read_num_rows;
auto read_all_pages = [&](const DataPageStats& stats) -> bool {
DCHECK_NE(stats.encoded_statistics, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASSERT_NE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot use ASSERT inside the lambda function definition.

// passed using the num_values parameter.
static std::shared_ptr<Statistics> Make(
const ColumnDescriptor* descr, const EncodedStatistics* encoded_statistics,
int64_t num_values = -1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why this is optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to add a convenient way to convert an EncodedStatistics to Statistics. However, the encoded statistics misses the number of non-null values that is required for making Statistics. So, I made an optional argument for it.

Please see here: #14603 (comment)

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from me. @pitrou @westonpace @wgtmac any more concerns?

@pitrou
Copy link
Member

pitrou commented Jan 9, 2023

LGTM from me. @pitrou @westonpace @wgtmac any more concerns?

Apparently you requested some changes? Feel free to merge when you feel everything was addressed.

@westonpace
Copy link
Member

Apparently you requested some changes? Feel free to merge when you feel everything was addressed.

Same here, I am content

@emkornfield
Copy link
Contributor

Travis CI failures look infrastructure related. going to merge.

@emkornfield emkornfield merged commit 97998d8 into apache:master Jan 12, 2023
@ursabot
Copy link

ursabot commented Jan 14, 2023

Benchmark runs are scheduled for baseline = 37a7965 and contender = 97998d8. 97998d8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.33% ⬆️0.21%] test-mac-arm
[Finished ⬇️0.77% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.06% ⬆️0.06%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 97998d83 ec2-t3-xlarge-us-east-2
[Finished] 97998d83 test-mac-arm
[Finished] 97998d83 ursa-i9-9960x
[Finished] 97998d83 ursa-thinkcentre-m75q
[Finished] 37a79659 ec2-t3-xlarge-us-east-2
[Finished] 37a79659 test-mac-arm
[Finished] 37a79659 ursa-i9-9960x
[Finished] 37a79659 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants