Issue 370: Add rollupSize as QbeastOption for writing and optimizing #375
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #370
This feature allows you to use a value other than the
desiredCubeSize
forrollup
when writing and optimizing.Background
Rollup is part of the write and optimization process where small blocks are iteratively grouped with their parent blocks to create larger files. This is done mainly to address the small file problem. If blocks were written to output files independently, the number of small files in the table would soon become overwhelming, especially in a streaming scenario where the input batches are relatively small in their number of rows, and each block only gets to have a small number of records.
If the
desiredCubeSize
happens to be similar to the number of records in the input batches, all blocks will be grouped, creating a single output file with all the blocks. This harms the index's sampling efficiency, as the files created will have a large weight range.Increasing the input batch size can alleviate the issue, but this is not always an option. A larger input size means the ingestion process must wait for more records before writing to the table. This leads to a higher ingestion latency and will require more computing resources to process the batches.
Another option is to reduce the
desiredCubeSize,
which can create too many cubes, entailing a larger metadata storage overhead. Moreover, changing thedesiredCubeSize
will create a separateRevision,
which the user may or may not want, depending on their use case.Using a custom
rollupSize
is handy for a scenario like this. Changing its values does not affect the topology of the index; rather, it only changes the way blocks are grouped.The following figure illustrates such a case. When a larger
rollupSize
is used, all blocks from the same branch are written to the same file. On the other hand, when a smallerrollupSize
is used, separate block sections can be written to separate files. If we were to sample withf = 0.5
, "File_3.parquet" could be ignored.Improvements
The following are four tables created using 500 input batches, each with around 15,000 records. The different configurations of
rollupSize
yielded different results.The first two cases use a
cubeSize
similar to the input batch size. A largerollupSize
groups all blocks into the same file, and they all contribute to the sampling overhead with their redundant blocks.Reducing the
rollupSize
to3000
improves the sampling efficiency as it groups fewer blocks. More files are created per input batch, each with a smaller weight range.A
desiredCubeSize
of3000
creates more blocks than the first two cases.Usage
WARNING: The
rollupSize
is not stored in the_delta_log/,
so it must be provided as an input option for each execution. The value defaults to thedesiredCubeSize
when not provided,TODO: