-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Reading and Writing RLE data from/to Parquet #32339
Comments
RLE is now called REE. |
Should we first implement REE write to Plain? (Maybe the best encoding is RLE in parquet, but it might need some complex logic) |
That would be great and a stop-gap solution would be to use If you want a more memory-efficient solution, you can look at the
Yes, ideally we should write REE Arrow arrays as RLE Parquet arrays. It's on the level of complexity of
I'm not familiar with the Parquet format, but AFAIU, it is able to store run-LENGTH encoded data, so reading that directly into a run-END encoded array would require less copying and less memory. |
I think it should take some time to implement it.
Maybe we could first implement (1), then regard (2) as an optimization @felipecrv ? |
Yes. Sounds like a good plan. This is the list of types supported by arrow/cpp/src/arrow/compute/kernels/vector_run_end_encode.cc Lines 591 to 611 in 99dd998
|
@jsjtxietian would you like to working on this? |
ok, will take a look this week |
@jsjtxietian You can take a look at |
Parquet already supports RLE data. So this would involve adding a new (optional) code path that this data can be directly read into the new Arrow RLE arrays, without decoding it.
Reporter: Tobias Zagorni / @zagto
Note: This issue was originally created as ARROW-17028. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: