Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 370: Add rollupSize as QbeastOption for writing and optimizing #375

Closed

Conversation

Jiaweihu08
Copy link
Member

@Jiaweihu08 Jiaweihu08 commented Aug 1, 2024

Fixes #370

This feature allows you to use a value other than the desiredCubeSize for rollup when writing and optimizing.

Background

Rollup is part of the write and optimization process where small blocks are iteratively grouped with their parent blocks to create larger files. This is done mainly to address the small file problem. If blocks were written to output files independently, the number of small files in the table would soon become overwhelming, especially in a streaming scenario where the input batches are relatively small in their number of rows, and each block only gets to have a small number of records.

If the desiredCubeSize happens to be similar to the number of records in the input batches, all blocks will be grouped, creating a single output file with all the blocks. This harms the index's sampling efficiency, as the files created will have a large weight range.

Increasing the input batch size can alleviate the issue, but this is not always an option. A larger input size means the ingestion process must wait for more records before writing to the table. This leads to a higher ingestion latency and will require more computing resources to process the batches.

Another option is to reduce the desiredCubeSize, which can create too many cubes, entailing a larger metadata storage overhead. Moreover, changing the desiredCubeSize will create a separate Revision, which the user may or may not want, depending on their use case.

Using a custom rollupSize is handy for a scenario like this. Changing its values does not affect the topology of the index; rather, it only changes the way blocks are grouped.

The following figure illustrates such a case. When a larger rollupSize is used, all blocks from the same branch are written to the same file. On the other hand, when a smaller rollupSize is used, separate block sections can be written to separate files. If we were to sample with f = 0.5, "File_3.parquet" could be ignored.

cubeSize

Improvements

The following are four tables created using 500 input batches, each with around 15,000 records. The different configurations of rollupSize yielded different results.

The first two cases use a cubeSize similar to the input batch size. A large rollupSize groups all blocks into the same file, and they all contribute to the sampling overhead with their redundant blocks.

Reducing the rollupSize to 3000 improves the sampling efficiency as it groups fewer blocks. More files are created per input batch, each with a smaller weight range.

A desiredCubeSize of 3000 creates more blocks than the first two cases.

cubeSize rollupSize 0.001% sampleFraction 1% sampleFraction 10% sampleFraction
15000 15000 16.84% 100.00% 100.00%
15000 3000 1.65% 34.30% 82.77%
3000 3000 1.77% 78.43% 95.13%

Usage

WARNING: The rollupSize is not stored in the _delta_log/, so it must be provided as an input option for each execution. The value defaults to the desiredCubeSize when not provided,

  1. Writing
df
  .write
  .format("qbeast")
  .option("columnsToIndex", "col_1,col_2")
  .option("cubeSize", 15000)
  .option("rollupSize", 3000)
  .save(tablePath)
  1. Optimization
val qt = QbeastTable.forPath(spark, tablePath)
val optimizationOptions = Map("rollupSize" -> "3000")
qt.optimize(filePaths, optimizationOptions)

TODO:

  • Update documentations

@Jiaweihu08 Jiaweihu08 closed this Aug 2, 2024
@Jiaweihu08 Jiaweihu08 deleted the 370-rollupCubeSize-option branch August 2, 2024 16:25
@Jiaweihu08 Jiaweihu08 restored the 370-rollupCubeSize-option branch August 2, 2024 16:26
@Jiaweihu08 Jiaweihu08 reopened this Aug 2, 2024
@Jiaweihu08 Jiaweihu08 changed the title Issue 370: Add rollupCubeSize as QbeastOption for writing and optimizing Issue 370: Add rollupSize as QbeastOption for writing and optimizing Aug 5, 2024
@Jiaweihu08 Jiaweihu08 marked this pull request as ready for review August 5, 2024 13:32
@Jiaweihu08 Jiaweihu08 requested review from fpj and cugni August 5, 2024 13:32
@Jiaweihu08 Jiaweihu08 self-assigned this Aug 5, 2024
@Jiaweihu08 Jiaweihu08 closed this Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Option for rollupCubeSize
1 participant