Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update writers to include compression method in file name #1431

Merged
merged 6 commits into from
Jun 7, 2023

Conversation

Blajda
Copy link
Collaborator

@Blajda Blajda commented Jun 3, 2023

Description

The compression name was hard-coded to include snappy however users can now specify their own methods which will cause a disconnect in the the name and the method used.

The naming convention is used as a hint by spark and hive to create the correct reader without reading having read the header of the file.

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels Jun 3, 2023
@wjones127
Copy link
Collaborator

Thanks for fixing this. 😄

The naming convention is used as a hint by spark and hive to create the correct reader without reading having read the header of the file.

I'd like to know more about how they use this. Each column is allowed to have a different compression in Parquet, so the Java readers need to be ready to use any compression decoder anyways. If there's not much benefit, it seems like it would be better to just write out a plain .parquet file, right? Is there any code or docs we can point to where Spark and Hive use this?

@Blajda Blajda marked this pull request as ready for review June 4, 2023 00:26
@Blajda
Copy link
Collaborator Author

Blajda commented Jun 4, 2023

@houqp since you provided the original insight in the other PR can you help direct us to the source of that claim?

rust/src/writer/utils.rs Outdated Show resolved Hide resolved
Comment on lines +111 to +117
// TODO: what does c000 mean?
let file_name = format!(
"part-{}-{}-c000{}.parquet",
part,
writer_id,
compression_to_str(&compression)
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be trying to copy Spark. This information is mostly added to make debugging easier. For example, they have a writer id because they might want to see all the files written by a particular node in a Spark cluster. But we don't have nodes or retries like Spark does. So I think we can have our own convention.

At a minimum, I think all we need is {uuid}.parquet. This is just to make sure we don't have collisions between files. Adding the compression codec can be nice for debugging. If we find more things that's are useful to delta-rs, we can consider adding them too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for locating the Spark source. I tried searching myself but only found the mapping for orc files.
I left some of the comments / TODO that were scattered in the implementation just in case if we wanted to completely copy the Spark approach. Since all writers now use this function it should be easier in the future to make our own convention as you mentioned.

@wjones127 wjones127 enabled auto-merge (squash) June 7, 2023 01:48
@wjones127 wjones127 merged commit 197db59 into delta-io:main Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants