Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Unable to read data from parquet file generated with parquetjs #42868

Closed
asfimport opened this issue Dec 21, 2018 · 10 comments
Closed

Comments

@asfimport
Copy link
Collaborator

asfimport commented Dec 21, 2018

See attached file, when I debug:

% ./parquet-reader feed1kMicros.parquet

I see that the scanner->HasNext() always returns false.

Reporter: Hatem Helal / @hatemhelal
Assignee: Rylan Dmello / @rdmello

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as PARQUET-1482. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Hatem Helal / @hatemhelal:
I think this is a problem in parquet-cpp since I've confirmed that parquet-tools can read this file.

@asfimport
Copy link
Collaborator Author

Hatem Helal / @hatemhelal:
@wesm, my colleague @rdmello  is working on a fix for this.  Could you help us out by adding him as a contributor on this project?  Thanks!

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Done

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Issue resolved by pull request 3312
#3312

@asfimport
Copy link
Collaborator Author

Tera G:
Hi Everyone,

I see that this fix has been made in arrow's record reader (record_reader.cc). I am using the parquet's low-level API to pull the data from the parquet file in my application.

I am facing the exact problem fixed by this Jira while using the Parquet's low level API.(column_reader.cc).  

As the current fix is not ported to the low level parquet api, I wanted to know if there are any plans to ship these changes to the low-level-api ? 

Also, @rdmello, can I simply port the fixes you have made in the parquet low-level api ? Will this work ? 

We are using low-level api as it offers more power to us in terms of predicate push down, filtering and skipping of data.

Finally, Is the Open source community's push is to advise developers to use arrow's parquet api or the low level parquet api to access the parquet data ? 

Thank you in advance for your response. 

@asfimport
Copy link
Collaborator Author

Rylan Dmello / @rdmello:
Hi [~terag], I haven't looked at implementing these changes with the low-level API yet. I see that "column_reader.cc" has a similar TypedRecordReader method as "record_reader.cc", and that there's a similar conditional statement there that excludes DATA_PAGE_V2 pages.

I'm not super familiar with the low-level API, but I think a similar set of changes might work for fixing this issue with the low-level API too. If you already have code that fixes this, I'd recommend sending in a pull request for this. Otherwise I can take a closer look at porting this fix to the low-level API tomorrow.

@asfimport
Copy link
Collaborator Author

Tera G:
Hi @rdmello

Thank you so much for your quick response. 

No, we have not yet started making those changes. I will really appreciate if you can make those changes.

Thanks again.

 

@asfimport
Copy link
Collaborator Author

Tera G:
Hi @rdmello

Did you get the time to look into the problem ?

Thanks.

@asfimport
Copy link
Collaborator Author

Rylan Dmello / @rdmello:
Hi [~terag], sorry, I did take a look at this, but didn't really have the time to resolve this over the last few weeks.

I just opened a new Jira issue to add basic DataPageV2 support to the low-level API: https://issues.apache.org/jira/browse/PARQUET-1560 . I can add updates to that issue instead of this one, since this is already resolved.

I couldn't easily reproduce the issue when using the low-level API to read the 'feeds1kMicros.parquet' file generated by parquetjs. Either this has already been fixed in arrow/master, or I might need to dig deeper to understand the problem. Do you possibly have an example parquet file which isn't readable with the low-level API? If so, feel free to attach it to the new Jira issue I linked.

@asfimport
Copy link
Collaborator Author

Tera G:
Hi @rdmello

sorry for the delayed response. I was on vacation from last 2 weeks. 

I have attached the v2 file to PARQUET-1560 JIRA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant