Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Reading and Writing RLE data from/to Parquet #32339

Open
Tracked by #32104
asfimport opened this issue Jul 9, 2022 · 8 comments
Open
Tracked by #32104

[C++] Reading and Writing RLE data from/to Parquet #32339

asfimport opened this issue Jul 9, 2022 · 8 comments

Comments

@asfimport
Copy link
Collaborator

Parquet already supports RLE data. So this would involve adding a new (optional) code path that this data can be directly read into the new Arrow RLE arrays, without decoding it.

Reporter: Tobias Zagorni / @zagto

Note: This issue was originally created as ARROW-17028. Please see the migration documentation for further details.

@felipecrv
Copy link
Contributor

RLE is now called REE.

@mapleFU
Copy link
Member

mapleFU commented Jun 15, 2023

Should we first implement REE write to Plain?
And when reading from Parquet, what can be used to build the REE Array? And how can we benifits from it?

(Maybe the best encoding is RLE in parquet, but it might need some complex logic)

@felipecrv
Copy link
Contributor

felipecrv commented Jun 15, 2023

Should we first implement REE write to Plain?

That would be great and a stop-gap solution would be to use RunEndDecode (expanding the array into memory first, then writing to Parquet).

If you want a more memory-efficient solution, you can look at the RunEndDecode code and write an adaptation that streams data into a plain Parquet array if required to be plain by the Parquet schema for some reason.

(Maybe the best encoding is RLE in parquet, but it might need some complex logic)

Yes, ideally we should write REE Arrow arrays as RLE Parquet arrays. It's on the level of complexity of RunEndDecode and I can help you extract common code from there if we find opportunities of re-use.

And when reading from Parquet, what can be used to build the REE Array? And how can we benifits from it?

I'm not familiar with the Parquet format, but AFAIU, it is able to store run-LENGTH encoded data, so reading that directly into a run-END encoded array would require less copying and less memory.

@mapleFU
Copy link
Member

mapleFU commented Jun 15, 2023

I think it should take some time to implement it.

  1. Naive implemention: suitable for all kinds of encoder, writer does like decode from RunEndDecode and Put into encoder buffer
  2. Special implemention for dict: for non-boolean type( I don't know whether string or binary is supported), REE can optimize for dictionary encoder.
  3. Special implemention for RLE (RLE / bitpack): I guess it can be implemented, but currently I think it's a bit tricky

Maybe we could first implement (1), then regard (2) as an optimization @felipecrv ?

@felipecrv
Copy link
Contributor

felipecrv commented Jun 15, 2023

Maybe we could first implement (1), then regard (2) as an optimization @felipecrv ?

Yes. Sounds like a good plan.

This is the list of types supported by RunEndEncode and RunEndDecode so far. It doesn't include all nested types at the moment as I figured those would be more likely to be dict-encoded when compressed.

add_kernel(Type::NA);
add_kernel(Type::BOOL);
for (const auto& ty : NumericTypes()) {
add_kernel(ty->id());
}
add_kernel(Type::DATE32);
add_kernel(Type::DATE64);
add_kernel(Type::TIME32);
add_kernel(Type::TIME64);
add_kernel(Type::TIMESTAMP);
add_kernel(Type::DURATION);
for (const auto& ty : IntervalTypes()) {
add_kernel(ty->id());
}
add_kernel(Type::DECIMAL128);
add_kernel(Type::DECIMAL256);
add_kernel(Type::FIXED_SIZE_BINARY);
add_kernel(Type::STRING);
add_kernel(Type::BINARY);
add_kernel(Type::LARGE_STRING);
add_kernel(Type::LARGE_BINARY);

@mapleFU
Copy link
Member

mapleFU commented Jul 4, 2023

@jsjtxietian would you like to working on this?

@jsjtxietian
Copy link
Contributor

@jsjtxietian would you like to working on this?

ok, will take a look this week

@mapleFU
Copy link
Member

mapleFU commented Jul 10, 2023

@jsjtxietian You can take a look at TypedColumnWriterImpl<DType>::WriteArrowDictionary, and do the same things. If you meet any problem you can ask me for that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants