Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #416: Add CDFQuantile Transformers and Transformations #413

Merged
merged 54 commits into from
Oct 1, 2024

Conversation

osopardo1
Copy link
Member

@osopardo1 osopardo1 commented Sep 13, 2024

Description

Adds #416 (prior #338)

  • Adds interfaces for CDFQuantileTransformerand CDFQuantileTransformation.
  • Changes StringHistogramTransformationfor CDFStringQuantileTransformer.
  • Changes StringHistogramTransformation for CDFStringQuantileTransformation.
  • Adds CDFNumericQuantileTransformer and CDFNumericQuantileTransformation.
  • Adds ManualPlaceholderTransformation and ManualColumnStats to control how the transformations are initialized.
  • The quantiles SHOULD be added through columnStats.
  • The number of quantiles in the QbeastUtils method, would be 50 by default.
  • The relative error for quantiles calculation in QbeastUtils method, would be 0.1 by default.
  • For triggering new Revision (changing index parameters), we can use a different set of quantiles in the columnStats option.

API

  1. Compute the Quantiles of a column (strictly as Numeric or String) with QbeastUtils.computeQuantilesForColumn method:
  /**
   * Compute the quantiles for a given column
   * ...
   * @param df
   *   DataFrame
   * @param columnName
   *   Column name
   * @param numberOfQuantiles
   *   Number of Quantiles, default is 50
   * @param relativeError
   *   Relative Error, default is 0.1
   * @return
   */
  def computeQuantilesForColumn(
      df: DataFrame,
      columnName: String,
      numberOfQuantiles: Int = 50,
      relativeError: Double = 0.1)
val df = spark.range(0, 100).toDF("a")
val columnName = "a"
val columnQuantiles =
  QbeastUtils.computeQuantilesForColumn(df = df, columnName = columnName)
val columnQuantilesNumberOfQuantiles =   
  QbeastUtils.computeQuantilesForColumn(df = df, columnName = columnName, numberOfQuantiles = 100)
val columnQuantilesRelativeError =   
  QbeastUtils.computeQuantilesForColumn(df = df, columnName = columnName, relativeError = 0.3)
val columnQuantilesNumAndError =
  QbeastUtils.computeQuantilesForColumn(df = df, columnName = columnName, numberOfQuantiles = 100, relativeError = 0.3)
  1. Index the data indicating the column type to index and the columnStats option.
df.write
  .mode("overwrite")
  .format("qbeast")
  .option("cubeSize", "30000")
  .option("columnsToIndex", s"$columnName:quantiles")
  .option("columnStats", s"""{"${columnName}_quantiles":$columnQuantiles}""")
  .save("/tmp/test-quantiles")

Example with default configuration:

import io.qbeast.spark.utils.QbeastUtils
import spark.implicits._

val df = spark.range(0, 100).toDF("a")
val columnName = "a"
val columnQuantiles =
  QbeastUtils.computeQuantilesForColumn(df, columnName)

df.write
  .mode("overwrite")
  .format("qbeast")
  .option("cubeSize", "30000")
  .option("columnsToIndex", s"$columnName:quantiles")
  .option("columnStats", s"""{"${columnName}_quantiles":$columnQuantiles}""")
  .save("/tmp/test-quantiles")

Be mindful that this interface is evolving and might change in the near future.

Trigger updated quantiles

To trigger an update of the Revision with a new set of quantiles (when the data is changing distribution drastically), would be enough to add a different set of columnStats.

Type of change

Developing a new feature.

Checklist:

Here is the list of things you should do before submitting this pull request:

  • New feature / bug fix has been committed following the Contribution guide.
  • Add logging to the code following the Contribution guide.
  • Add comments to the code (make it easier for the community!).
  • Change the documentation.
  • Add tests.
  • Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

Please describe the tests that you ran to verify your changes.

Unit test are in the package io.qbeast.core.transform under QuantileTransformationTest and QuantileTransformerTest.

For integration testing, we have:
- io.qbeast.spark.index.model.transformer.QuantileTransformerIndexingTest that does checks of the final index distribution (TODO)
- io.qbeast.spark.index.SparkRevisionFactoryTest for testing the correct creation of the transformations in a Revision.

Tested with the NYC Taxi Dataset. This is the output index metrics for an index using Quantiles vs the default linear indexing.

Metrics with Linear Transformation

OTree Index Metrics:
revisionId: 1
elementCount: 77966324
dimensionCount: 4
desiredCubeSize: 3248596
indexingColumns: PULocationID:linear,DOLocationID:linear,tpep_pickup_datetime:linear,tpep_dropoff_datetime:linear
height: 5 (3)
avgFanout: 5.2 (16.0)
cubeCount: 79
blockCount: 79
fileCount: 15
bytes: 1412854170

Metrics with Quantiles Transformation

OTree Index Metrics:
revisionId: 1
elementCount: 77966324
dimensionCount: 4
desiredCubeSize: 3248596
indexingColumns: PULocationID:quantiles,DOLocationID:quantiles,tpep_pickup_datetime:quantiles,tpep_dropoff_datetime:quantiles
height: 3 (3)
avgFanout: 12.0 (16.0)
cubeCount: 109
blockCount: 109
fileCount: 9
bytes: 1398833883 

We can notice that the avgFanout and the height generated by the Quantiles approach are more evenly than the ones written using a Linear Transfromation.

@osopardo1
Copy link
Member Author

Some comments TODOs:

  • It seems like QuantileTransformation and HistogramTransformation maps the values in the same way. This is because the scope of the transformation is not clear: we should use the CDF as the abstraction for both of them.
  • The only difference between one and the other is the use of strings. And this only affects the way of calculating the bins.
  • The bins should be computed during the analyze part, not in an external process.
  • This PR & issue would only tackle the reorganization & renaming of the code.

This is the new proposed structure:

  • CDFTransformation and Transformer: Abstract class to define the parameters to save of our transformation methodology.
  • CDFQuantileTransformation and Transformer: both StringHistogram and Quantile Transformations would be grouped in a single class.
  • For the future, we would add other types of CDF implementations, such as Histograms or Sketches.

@osopardo1 osopardo1 mentioned this pull request Sep 16, 2024
@osopardo1 osopardo1 changed the title Issue #338: Quantile Transformers API Issue #416: Refactor Transformers for CDF Sep 16, 2024
@osopardo1 osopardo1 changed the title Issue #416: Refactor Transformers for CDF Issue #416: Refactor Transformers to include CDF Sep 16, 2024
@osopardo1 osopardo1 changed the title Issue #416: Refactor Transformers to include CDF Issue #416: Refactor Transformers to use CDF Quantiles Sep 16, 2024
@osopardo1 osopardo1 changed the title Issue #416: Refactor Transformers to use CDF Quantiles Issue #416: Add CDF Transformers and Transformations Sep 16, 2024
@osopardo1 osopardo1 marked this pull request as ready for review September 17, 2024 12:19
Copy link
Member

@cugni cugni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but there are many changes, thus it is hard to review.
I was afraid the new tests were going to slow down the CI, but it doesn't seem so.

Copy link
Member

@Jiaweihu08 Jiaweihu08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments.

@Jiaweihu08 Jiaweihu08 merged commit c70670f into Qbeast-io:main Oct 1, 2024
1 check passed
JosepSampe added a commit that referenced this pull request Oct 24, 2024
* Issue #424: Add sampling fraction option for optimization (#426)

* Add sampling fraction option for optimization and remove analyze from QbeastTable

* Issue #430: Simplify denormalized blocks creation (#431)

* Simplify Denormalized Blocks

* Issue #416: Add CDFQuantile Transformers and Transformations (#413)

* Issue 264: Update qviz for multiblock files (#437)

* Update Qbeast Visualiser (qviz) with multiblock files

---------

Co-authored-by: Jorge Marín <jorge.marin.rodenas@estudiantat.upc.edu>
Co-authored-by: Jorge Marín <100561030+jorgeMarin1@users.noreply.github.com>

* Issue #441: Fix dataChange flag in optimize (#444)

* Merge from main branch

---------

Co-authored-by: jiawei <47899566+Jiaweihu08@users.noreply.github.com>
Co-authored-by: Paola Pardo <paolapardoat@gmail.com>
Co-authored-by: Jorge Marín <jorge.marin.rodenas@estudiantat.upc.edu>
Co-authored-by: Jorge Marín <100561030+jorgeMarin1@users.noreply.github.com>
JosepSampe pushed a commit to JosepSampe/qbeast-spark that referenced this pull request Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants