Issue 370: Add rollupSize as QbeastOption for writing and optimizing #375

Jiaweihu08 · 2024-08-01T15:01:21Z

Fixes #370

This feature allows you to use a value other than the desiredCubeSize for rollup when writing and optimizing.

Background

Rollup is part of the write and optimization process where small blocks are iteratively grouped with their parent blocks to create larger files. This is done mainly to address the small file problem. If blocks were written to output files independently, the number of small files in the table would soon become overwhelming, especially in a streaming scenario where the input batches are relatively small in their number of rows, and each block only gets to have a small number of records.

If the desiredCubeSize happens to be similar to the number of records in the input batches, all blocks will be grouped, creating a single output file with all the blocks. This harms the index's sampling efficiency, as the files created will have a large weight range.

Increasing the input batch size can alleviate the issue, but this is not always an option. A larger input size means the ingestion process must wait for more records before writing to the table. This leads to a higher ingestion latency and will require more computing resources to process the batches.

Another option is to reduce the desiredCubeSize, which can create too many cubes, entailing a larger metadata storage overhead. Moreover, changing the desiredCubeSize will create a separate Revision, which the user may or may not want, depending on their use case.

Using a custom rollupSize is handy for a scenario like this. Changing its values does not affect the topology of the index; rather, it only changes the way blocks are grouped.

The following figure illustrates such a case. When a larger rollupSize is used, all blocks from the same branch are written to the same file. On the other hand, when a smaller rollupSize is used, separate block sections can be written to separate files. If we were to sample with f = 0.5, "File_3.parquet" could be ignored.

Improvements

The following are four tables created using 500 input batches, each with around 15,000 records. The different configurations of rollupSize yielded different results.

The first two cases use a cubeSize similar to the input batch size. A large rollupSize groups all blocks into the same file, and they all contribute to the sampling overhead with their redundant blocks.

Reducing the rollupSize to 3000 improves the sampling efficiency as it groups fewer blocks. More files are created per input batch, each with a smaller weight range.

A desiredCubeSize of 3000 creates more blocks than the first two cases.

cubeSize rollupSize 0.001% sampleFraction 1% sampleFraction 10% sampleFraction

15000 15000 16.84% 100.00% 100.00%

15000 3000 1.65% 34.30% 82.77%

3000 3000 1.77% 78.43% 95.13%

Usage

WARNING: The rollupSize is not stored in the _delta_log/, so it must be provided as an input option for each execution. The value defaults to the desiredCubeSize when not provided,

Writing

df
  .write
  .format("qbeast")
  .option("columnsToIndex", "col_1,col_2")
  .option("cubeSize", 15000)
  .option("rollupSize", 3000)
  .save(tablePath)

Optimization

val qt = QbeastTable.forPath(spark, tablePath)
val optimizationOptions = Map("rollupSize" -> "3000")
qt.optimize(filePaths, optimizationOptions)

TODO:

Update documentations

WIP, add rollupCubeSize as QbeastOption for writing and optimizing

4aff09d

Jiaweihu08 requested a review from osopardo1 August 1, 2024 16:19

Jiaweihu08 added 3 commits August 2, 2024 18:13

Rename rollupCubeSize to rollupSize

f58a4da

Add rollupSize test for RollupDataWriter

71272c3

Merge into main

d6c5cf7

Jiaweihu08 closed this Aug 2, 2024

Jiaweihu08 deleted the 370-rollupCubeSize-option branch August 2, 2024 16:25

Jiaweihu08 restored the 370-rollupCubeSize-option branch August 2, 2024 16:26

Jiaweihu08 reopened this Aug 2, 2024

Jiaweihu08 changed the title ~~Issue 370: Add rollupCubeSize as QbeastOption for writing and optimizing~~ Issue 370: Add rollupSize as QbeastOption for writing and optimizing Aug 5, 2024

Jiaweihu08 added 2 commits August 5, 2024 14:15

Add rollupSize to toMap output

d2e9036

Fix tests

cd2b827

Jiaweihu08 marked this pull request as ready for review August 5, 2024 13:32

Jiaweihu08 requested review from fpj and cugni August 5, 2024 13:32

Jiaweihu08 self-assigned this Aug 5, 2024

Jiaweihu08 added 2 commits August 5, 2024 16:07

Test rollupSize for optimize

1f228b3

Update documentation for rollupSize

1465735

Jiaweihu08 closed this Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 370: Add rollupSize as QbeastOption for writing and optimizing #375

Issue 370: Add rollupSize as QbeastOption for writing and optimizing #375

Jiaweihu08 commented Aug 1, 2024 •

edited

Loading

cubeSize	rollupSize	0.001% sampleFraction	1% sampleFraction	10% sampleFraction
15000	15000	16.84%	100.00%	100.00%
15000	3000	1.65%	34.30%	82.77%
3000	3000	1.77%	78.43%	95.13%

Issue 370: Add rollupSize as QbeastOption for writing and optimizing #375

Issue 370: Add rollupSize as QbeastOption for writing and optimizing #375

Conversation

Jiaweihu08 commented Aug 1, 2024 • edited Loading

Background

Improvements

Usage

Jiaweihu08 commented Aug 1, 2024 •

edited

Loading