Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order chunks for compression by range_start #7148

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RobAtticus
Copy link
Member

@RobAtticus RobAtticus commented Jul 23, 2024

Previously, when compressing chunks the order was non-deterministic; it was usaully ordered by chunk id, but not guaranteed. And while typically chunk id corresponds with range_start, it is not always so: backfilling chunks can lead to larger chunk ids for older chunks.

Generally, it seems preferable to compress chunks oldest to newest. When using the experimental compress_chunk_time_interval option, it is even more so preferable because it allows for the most efficient roll ups. If not, the chunks compressed out of order can "cut off" a rolled up chunk before it was completely full.

@RobAtticus RobAtticus force-pushed the rrk/order-compress-chunks branch from c7a5c90 to 13b9416 Compare July 23, 2024 18:27
Previously, when compressing chunks the order was non-deterministic;
it was usaully ordered by chunk id, but not guaranteed. And while
typically chunk id corresponds with range_start, it is not always
so: backfilling chunks can lead to larger chunk ids for older chunks.

Generally, it seems preferable to compress chunks oldest to newest.
When using the experimental compress_chunk_time_interval option, it
is even more so preferable because it allows for the most efficient
roll ups. If not, the chunks compressed out of order can "cut off"
a rolled up chunk before it was completely full.
@RobAtticus RobAtticus force-pushed the rrk/order-compress-chunks branch from 13b9416 to cea1cb5 Compare July 23, 2024 18:32
@RobAtticus RobAtticus marked this pull request as ready for review July 23, 2024 18:32
@svenklemm
Copy link
Member

Did you test this with hypertables with multiple dimensions?

@RobAtticus
Copy link
Member Author

I did not, but I'm realizing I want to probably just do it by the primary dimension

@antekresic
Copy link
Contributor

Is this going to cause even more problems for backilling? Meaning, ordering by range_start will probably make the backfilled chunks be compressed sooner, potentially even while they are being backfilled. This could happen even in current setup but now its guaranteed to be picked up sooner rather than later.

Thoughts on doing this ordering only if you are using compress_created_before to avoid said issues?

Other than that, I'm OK with this change.

Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good change, certainly better than no particular order. Wondering if it would be good to also add an option to do other ordering as a configuration option.

You might also want to consider ordering by all dimensions, to have more determinism with multi-dimensional partitioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement]: Compression job should process chunks in order of range_start
4 participants