-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #416: Add CDFQuantile Transformers and Transformations #413
Issue #416: Add CDFQuantile Transformers and Transformations #413
Conversation
# Conflicts: # src/test/scala/io/qbeast/core/transform/HistogramTransformationTest.scala # src/test/scala/io/qbeast/spark/index/TransformerIndexingTest.scala
Some comments TODOs:
This is the new proposed structure:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but there are many changes, thus it is hard to review.
I was afraid the new tests were going to slow down the CI, but it doesn't seem so.
src/main/scala/io/qbeast/core/transform/ManualPlaceholderTransformation.scala
Outdated
Show resolved
Hide resolved
src/test/scala/io/qbeast/spark/index/model/transformer/CDFNumericQuantilesIndexingTest.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments.
src/main/scala/io/qbeast/core/transform/CDFQuantilesTransformation.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/qbeast/core/transform/CDFStringQuantilesTransformation.scala
Outdated
Show resolved
Hide resolved
* Issue #424: Add sampling fraction option for optimization (#426) * Add sampling fraction option for optimization and remove analyze from QbeastTable * Issue #430: Simplify denormalized blocks creation (#431) * Simplify Denormalized Blocks * Issue #416: Add CDFQuantile Transformers and Transformations (#413) * Issue 264: Update qviz for multiblock files (#437) * Update Qbeast Visualiser (qviz) with multiblock files --------- Co-authored-by: Jorge Marín <jorge.marin.rodenas@estudiantat.upc.edu> Co-authored-by: Jorge Marín <100561030+jorgeMarin1@users.noreply.github.com> * Issue #441: Fix dataChange flag in optimize (#444) * Merge from main branch --------- Co-authored-by: jiawei <47899566+Jiaweihu08@users.noreply.github.com> Co-authored-by: Paola Pardo <paolapardoat@gmail.com> Co-authored-by: Jorge Marín <jorge.marin.rodenas@estudiantat.upc.edu> Co-authored-by: Jorge Marín <100561030+jorgeMarin1@users.noreply.github.com>
Description
Adds #416 (prior #338)
CDFQuantileTransformer
andCDFQuantileTransformation
.StringHistogramTransformation
forCDFStringQuantileTransformer
.StringHistogramTransformation
forCDFStringQuantileTransformation
.CDFNumericQuantileTransformer
andCDFNumericQuantileTransformation
.ManualPlaceholderTransformation
andManualColumnStats
to control how the transformations are initialized.columnStats
.number of quantiles
in theQbeastUtils
method, would be 50 by default.relative error
for quantiles calculation inQbeastUtils
method, would be 0.1 by default.columnStats
option.API
QbeastUtils.computeQuantilesForColumn
method:Example with default configuration:
Trigger updated quantiles
To trigger an update of the Revision with a new set of quantiles (when the data is changing distribution drastically), would be enough to add a different set of
columnStats
.Type of change
Developing a new feature.
Checklist:
Here is the list of things you should do before submitting this pull request:
How Has This Been Tested? (Optional)
Please describe the tests that you ran to verify your changes.
Unit test are in the package
io.qbeast.core.transform
underQuantileTransformationTest
andQuantileTransformerTest
.For integration testing, we have:
-
io.qbeast.spark.index.model.transformer.QuantileTransformerIndexingTest
that does checks of the final index distribution (TODO)-
io.qbeast.spark.index.SparkRevisionFactoryTest
for testing the correct creation of the transformations in a Revision.Tested with the NYC Taxi Dataset. This is the output index metrics for an index using Quantiles vs the default linear indexing.
Metrics with Linear Transformation
Metrics with Quantiles Transformation
We can notice that the
avgFanout
and theheight
generated by the Quantiles approach are more evenly than the ones written using a Linear Transfromation.