-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RoaringBitmapArray to store indices of rows deleted in a data file #1486
Conversation
Adds a new bitmap implementation called `RoaringBitmapArray`, which replaces usage of `Roaring64Bitmap` in deletion vectors. This implementation is optimized for use case of handling row indices, which are always clustered between 0 and the index of the last row number of a file, as opposed to being arbitrarily sparse over the whole `Long` space. The implementation is much simpler than `Roaring64Bitmap` and less error prone, as well as faster. GitOrigin-RevId: 862989348614520672065786c5607f0ade7a93e7
Can you crosslink to the issue in the description as well? |
Linked to the main issue which also contains the small project plan |
} | ||
|
||
/** Add all values in `range` to the container. */ | ||
def addRange(range: NumericRange[Long]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of methods... do you think you will need all of these methods for the full DV implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are going to be mostly used in tests when you can generate a custom DV for a given set of indices. Let me mark it as for tests only.
core/src/main/scala/org/apache/spark/sql/delta/deletionvectors/RoaringBitmapArray.scala
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/deletionvectors/RoaringBitmapArray.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure if all of the method you have added are going to useful, but I trust your plan in the issue. At the end of all your PRs, it will be good to double check whether all the methods are actually being used and remove any crud.
Description
This PR is part of the feature: Support reading Delta tables with deletion vectors (more at #1485)
Adds a new bitmap implementation called
RoaringBitmapArray
. This will be used to encode the deleted row indices. There already exists aRoaring64Bitmap
provided by theorg.roaringbitmap
library , but this implementation is optimized for use case of handling row indices, which are always clustered between 0 and the index of the last row number of a file, as opposed to being arbitrarily sparse over the wholeLong
space.How was this patch tested?
Unit tests
Does this PR introduce any user-facing changes?
No