Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Dictionary encoding for more types. #531

Open
EamonHetherton opened this issue Jul 9, 2024 · 0 comments
Open

Support Dictionary encoding for more types. #531

EamonHetherton opened this issue Jul 9, 2024 · 0 comments

Comments

@EamonHetherton
Copy link
Contributor

Issue description

Currently only string columns are considered for dictionary encoding. A lot of the data that I work with has very high repetition in other data typed columns (int, decimal and datetime mostly). I did a small spike to investigate the benefit of dictionary encoding these and the results were very encouraging, typically around 50x reduction in size when not using any compression. Whilst compression does help somewhat to reduce the scale of the difference, even still the snappy compressed version ended up 5x smaller with the additional types being dictionary encoded.

Of particular interest to me is the decimal datatype which takes 16 bytes in PLAIN encoding, but in a lot of my cases there are fewer than 30,000 distinct values so even in the degenerate case of all run lengths of 1, this would only be 3 bytes per value.

I'm happy to do the work and make the pull request (it's a pretty small change overall I believe), just wanted to understand if there was any other reason this has not been implemented to date?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant