Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1482: [C++] Add branch to TypedRecordReader::ReadNewPage for … #3312

Closed
wants to merge 3 commits into from

Conversation

rdmello
Copy link
Contributor

@rdmello rdmello commented Jan 5, 2019

…PageType::DATA_PAGE_V2 to address incompatibility with parquetjs.

Tests

This commit doesn't include tests right now; I am working on adding tests and was hoping for some initial feedback on the code changes. I may need to use an actual file generated by parquetjs to test this issue, so I wonder if adding feeds1kMicros.parquet from the JIRA task to the parquet-testing repository is an option for this.

Description

parquetjs seems to be writing Parquet V2 files with DataPageV2 pages, while parquet-cpp writes Parquet V2 files with DataPage pages.

Since TypedRecordReader::ReadNewPage() only had a branch for PageType::DATA_PAGE, the reader would return without reading any data for records that have DATA_PAGE_V2 pages. This explains the behavior observed in PARQUET-1482.

This commit adds a new if-else branch for the DataPageV2 case in TypedRecordReader::ReadNewPage(). Since the DataPageV2 branch needed to reuse the code from the DataPage case, I refactored the repetition/definition level decoder initialization and the data decoder initialization to two new methods in the TypedRecordReader class. These new methods are now called by the DataPage and DataPageV2 initialization branches in TypedRecordReader::ReadNewPage().

There is an alternate implementation possible (with a smaller diff) by sharing the same else-if branch between DataPage and DataPageV2 using a pointer-to-derived shared_ptr<Page>. However, since the Page superclass doesn't have the necessary encoding() or num_values() methods, I would need to add a common superclass to both DataPage and DataPageV2 that defined these methods. I didn't do this because I was hesitant to modify the Page class hierarchy for this commit.

@wesm
Copy link
Member

wesm commented Jan 5, 2019

The parquetjs project should probably be warned about cross-compatibility -- V2 data pages are likely unreadable in a number of places.

@rdmello
Copy link
Contributor Author

rdmello commented Jan 7, 2019

Hi Wes, thanks for responding on this! I just opened an issue with the parquetjs project to make them aware of this too: ironSource/parquetjs#78 .

I've noticed the same issue (from PARQUET-1482) occurring with files written using the ParquetFileWriter class in parquet-mr too (through the MATLAB interface for Apache Parquet).

I can't reproduce this using the newest release of parquet-mr (1.11.0), but I think a previous version of it (1.9.0) would also default to writing Parquet V2 files with DataPageV2 pages. Therefore this may be more widespread than only files written using the parquetjs library.

I see that there are Travis CI build failures, but these seem to be unrelated to the changes in this commit. I was thinking of pulling in changes from apache:master after the build is passing, rebasing the parquet_1482 branch, and then re-running the CI, but let me know if there's a better way to do this.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in general happy with the changes but we definitely should have a test for this code. You could add probably in the most simple form to https://github.com/apache/arrow/blob/master/cpp/src/parquet/file-deserialize-test.cc

@wesm
Copy link
Member

wesm commented Jan 29, 2019

I agree that a unit test would be a good idea. When we were building this library initially I tried to make the file deserialization internals a bit more accessible to unit testing, otherwise triggering and testing various code paths would be a lot more difficult

@rdmello
Copy link
Contributor Author

rdmello commented Jan 29, 2019

Hi Uwe and Wes, thank you for the recommendations. I added a test in file-deserialize-test.cc that exercises the DataPageV2 serialization and deserialization code path in the latest version of the pull request. Please let me know if you have any feedback on this.

My understanding is that the new test will not directly test the changes to record_reader.cc that I've made. To do a round-trip test of DataPageV2 workflows, the Arrow interface to Parquet will probably need a way to write Parquet files with DataPageV2 page headers. I couldn't find a way to do this directly using the Arrow-based interface for Parquet, but let me know if I'm missing something here.

I was thinking of creating a new JIRA story for writing DataPageV2 headers using the Arrow interface for Parquet, but I found PARQUET-458 that seems to be similar. Is it alright if I assign this JIRA story to myself?

@xhochy
Copy link
Member

xhochy commented Jan 30, 2019

@rdmello Feel free to assign PARQUET-458 to yourself.

To enable writing V2 pages in the Arrow, you should add an option https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L142-L433 to switch between V1 and V2 pages. This option needs then be passed through the layers where the page is actually created.

@hatemhelal
Copy link
Contributor

I've gone through these changes and they look good to me.

Would it be ok to accept this PR and follow up with a new PR to respond to the feedback? @rdmello could use PARQUET-458 to complete the write implementation and add the suggested tests.

@xhochy and @wesm, thoughts?

@wesm
Copy link
Member

wesm commented Feb 28, 2019

Sorry for the delay, will review this so it gets into 0.13.0

rdmello and others added 2 commits March 6, 2019 10:48
…PageType::DATA_PAGE_V2 to address incompatibility with parquetjs.

Tests

This commit doesn't include tests; I am working on them now. I may need to use an actual file generated by parquetjs to test this issue, so I wonder if adding feeds1kMicros.parquet from the JIRA task to the parquet-testing repository is an option.

Description

parquetjs seems to be writing Parquet V2 files with DataPageV2 pages, while parquet-cpp writes Parquet V2 files with DataPage pages.

Since TypedRecordReader::ReadNewPage() only had a branch for PageType::DATA_PAGE, the reader would return without reading any data for records that have DATA_PAGE_V2 pages. This explains the behavior observed in PARQUET-1482.

This commit adds a new if-else branch for the DataPageV2 case in TypedRecordReader::ReadNewPage(). Since the DataPageV2 branch needed to reuse the code from the DataPage case, I refactored the repetition/definition level decoder initialization and the data decoder initialization to two new methods in the TypedRecordReader class. These new methods are now called by the DataPage and DataPageV2 initialization branches in TypedRecordReader::ReadNewPage().

There is an alternate implementation possible (with a smaller diff) by sharing the same else-if branch between DataPage and DataPageV2 using a pointer-to-derived shared_ptr<Page>. However, since the Page superclass doesn't have the necessary encoding() or num_values() methods, I would need to add a common superclass to both DataPage and DataPageV2 that defined these methods. I didn't do this because I was hesitant to modify the Page class hierarchy for this commit.
@wesm
Copy link
Member

wesm commented Mar 6, 2019

Rebased. Have you reported an issue into upstream parquetjs?

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. The fact that DataPage and DataPageV2 do not have a common base seems like an eyesore to me. I'm going to quickly fix that

template <typename PageType>
int64_t InitializeLevelDecoders(const std::shared_ptr<PageType> page,
const Encoding::type repetition_level_encoding,
const Encoding::type definition_level_encoding);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const is not needed with Encoding::type


template <typename PageType>
void InitializeDataDecoder(const std::shared_ptr<PageType> page,
const int64_t levels_bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const not needed with int64

// Have not decoded any values from the data page yet
num_decoded_values_ = 0;

const uint8_t* buffer = page->data();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only data() and num_values() are used here. We should have a common base class for DataPageV1 and DataPageV2 instead of having this template

void TypedRecordReader<DType>::InitializeDataDecoder(const std::shared_ptr<PageType> page,
const int64_t levels_byte_size) {
const uint8_t* buffer = page->data() + levels_byte_size;
const int64_t data_size = page->size() - levels_byte_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@rdmello
Copy link
Contributor Author

rdmello commented Mar 6, 2019

Hi Wes, thanks for rebasing the code and providing feedback. Yes, I did report this to parquetjs earlier: ironSource/parquetjs#78

I've looked through some of the parquetjs code base, and my understanding is that there is a way to provide an option to the ParquetWriter class in parquetjs that would write DataPageV1 pages. However the default behavior for parquetjs is still to write DataPageV2 pages.

@wesm
Copy link
Member

wesm commented Mar 6, 2019

+1. Will merge once the build looks good

@wesm
Copy link
Member

wesm commented Mar 6, 2019

Appveyor is about half passed. Merging

@wesm wesm closed this in c16eccb Mar 6, 2019
wesm pushed a commit that referenced this pull request Mar 6, 2019
…PageType::DATA_PAGE_V2 to address incompatibility with parquetjs.

**Tests**

This commit doesn't include tests right now; I am working on adding tests and was hoping for some initial feedback on the code changes. I may need to use an actual file generated by `parquetjs` to test this issue, so I wonder if adding `feeds1kMicros.parquet` from the JIRA task to the parquet-testing repository is an option for this.

**Description**

`parquetjs` seems to be writing Parquet V2 files with [`DataPageV2`](https://github.com/apache/parquet-format/blob/e93dd628d90aa076745558998f0bf5d9c262bf22/src/main/thrift/parquet.thrift#L529) pages, while `parquet-cpp` writes Parquet V2 files with [`DataPage`](https://github.com/apache/parquet-format/blob/e93dd628d90aa076745558998f0bf5d9c262bf22/src/main/thrift/parquet.thrift#L492) pages.

Since `TypedRecordReader::ReadNewPage()` only had a branch for `PageType::DATA_PAGE`, the reader would return without reading any data for records that have `DATA_PAGE_V2` pages. This explains the behavior observed in [PARQUET-1482](https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1482?filter=allopenissues).

This commit adds a new if-else branch for the `DataPageV2` case in `TypedRecordReader::ReadNewPage()`. Since the `DataPageV2` branch needed to reuse the code from the `DataPage` case, I refactored the repetition/definition level decoder initialization and the data decoder initialization to two new methods in the `TypedRecordReader` class. These new methods are now called by the `DataPage` and `DataPageV2` initialization branches in `TypedRecordReader::ReadNewPage()`.

There is an alternate implementation possible (with a smaller diff) by sharing the same else-if branch between `DataPage` and `DataPageV2` using a pointer-to-derived `shared_ptr<Page>`. However, since the Page superclass doesn't have the necessary `encoding()` or `num_values()` methods, I would need to add a common superclass to both `DataPage` and `DataPageV2` that defined these methods. I didn't do this because I was hesitant to modify the `Page` class hierarchy for this commit.

Author: Wes McKinney <wesm+git@apache.org>
Author: rdmello <rylan.dmello@mathworks.com>
Author: Rylan Dmello <rdmello@users.noreply.github.com>

Closes #3312 from rdmello/parquet_1482 and squashes the following commits:

c5cb0f3 <Wes McKinney> Add DataPage base class for DataPageV1 and DataPageV2
8df8328 <rdmello> PARQUET-1482:  Adding basic unit test for DataPageV2 serialization and deserialization.
9df3222 <Rylan Dmello> PARQUET-1482:  Add branch to TypedRecordReader::ReadNewPage for PageType::DATA_PAGE_V2 to address incompatibility with parquetjs.
@kevingurney kevingurney deleted the parquet_1482 branch February 7, 2022 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants