Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Turn all Binary/Utf8 into BinaryView/Utf8View in Parquet #18331

Merged

Conversation

coastalwhite
Copy link
Collaborator

No description provided.

@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels Aug 23, 2024
@ritchie46
Copy link
Member

Can we eventually completely go to binview you think?

Copy link

codecov bot commented Aug 23, 2024

Codecov Report

Attention: Patch coverage is 84.92063% with 57 lines in your changes missing coverage. Please review.

Project coverage is 79.88%. Comparing base (c579721) to head (9dbf0ba).
Report is 60 commits behind head on main.

Files with missing lines Patch % Lines
...lars-parquet/src/arrow/read/deserialize/binview.rs 86.97% 34 Missing ⚠️
...et/src/arrow/read/deserialize/fixed_size_binary.rs 16.66% 5 Missing ⚠️
...rs-parquet/src/arrow/read/deserialize/utils/mod.rs 75.00% 5 Missing ⚠️
crates/polars-io/src/parquet/read/read_impl.rs 50.00% 4 Missing ⚠️
...olars-parquet/src/arrow/read/deserialize/simple.rs 0.00% 3 Missing ⚠️
crates/polars-arrow/src/pushable.rs 0.00% 2 Missing ⚠️
...olars-parquet/src/arrow/read/deserialize/nested.rs 0.00% 2 Missing ⚠️
.../polars-parquet/src/arrow/read/deserialize/null.rs 0.00% 1 Missing ⚠️
...quet/src/arrow/read/deserialize/primitive/basic.rs 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18331      +/-   ##
==========================================
+ Coverage   79.78%   79.88%   +0.09%     
==========================================
  Files        1497     1495       -2     
  Lines      200424   200199     -225     
  Branches     2844     2867      +23     
==========================================
+ Hits       159916   159925       +9     
+ Misses      39983    39728     -255     
- Partials      525      546      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coastalwhite
Copy link
Collaborator Author

Can we eventually completely go to binview you think?

I don't necessarily see a reason why not

@ritchie46
Copy link
Member

That would save unneeded casts and remove a lot of code.

@coastalwhite
Copy link
Collaborator Author

coastalwhite commented Aug 26, 2024

I converted everything now to use the BinView. This led me down a bit of a rabbit hole, because the BinView::total_bytes_len / total_buffer_len are not always correctly set (also by my previous code). There are still some places outside my code where this is the case.

@coastalwhite coastalwhite changed the title refactor: Always cast Utf8/Binary to Large in Parquet refactor: Turn all Binary/Utf8 to BinaryView/Utf8View in Parquet Aug 26, 2024
@coastalwhite coastalwhite changed the title refactor: Turn all Binary/Utf8 to BinaryView/Utf8View in Parquet refactor: Turn all Binary/Utf8 into BinaryView/Utf8View in Parquet Aug 26, 2024
// Verify the invariants
#[cfg(debug_assertions)]
{
// @TODO: Enable this. This is currently bugged with concatenate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is our information incorrect? Where does this happen? Should I fix this is in a seperate PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is happening in the legacy::concatenate kernel.

)?
.collect_n(filter)?
},
// These are all converted to View variants before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -618,6 +651,9 @@ impl MutableBinaryViewArray<[u8]> {
let buffer_idx = self.completed_buffers().len() as u32;
let in_progress_buffer_offset = self.in_progress_buffer.len();

self.total_bytes_len += sum_length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should total_bytes_len not be added outside of the branch?

@@ -639,6 +675,8 @@ impl MutableBinaryViewArray<[u8]> {
view
}));
} else if max_length <= View::MAX_INLINE_SIZE as usize {
self.total_bytes_len += sum_length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see you do that here. Maybe we can do it only once before the branches.

@coastalwhite coastalwhite force-pushed the refactor-always-convert-to-large branch from 7127577 to 9dbf0ba Compare August 27, 2024 08:18
@coastalwhite coastalwhite merged commit 884b2ac into pola-rs:main Aug 27, 2024
20 checks passed
@coastalwhite coastalwhite deleted the refactor-always-convert-to-large branch August 27, 2024 08:42
@c-peters c-peters added the accepted Ready for implementation label Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants