Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RoaringBitmapArray to store indices of rows deleted in a data file #1486

Closed
wants to merge 2 commits into from

Conversation

vkorukanti
Copy link
Collaborator

@vkorukanti vkorukanti commented Nov 16, 2022

Description

This PR is part of the feature: Support reading Delta tables with deletion vectors (more at #1485)

Adds a new bitmap implementation called RoaringBitmapArray. This will be used to encode the deleted row indices. There already exists a Roaring64Bitmap provided by the org.roaringbitmap library , but this implementation is optimized for use case of handling row indices, which are always clustered between 0 and the index of the last row number of a file, as opposed to being arbitrarily sparse over the whole Long space.

How was this patch tested?

Unit tests

Does this PR introduce any user-facing changes?

No

Adds a new bitmap implementation called `RoaringBitmapArray`, which replaces usage
of `Roaring64Bitmap` in deletion vectors. This implementation is optimized for use
case of handling row indices, which are always clustered between 0 and the index
of the last row number of a file, as opposed to being arbitrarily sparse over the
whole `Long` space. The implementation is much simpler than
`Roaring64Bitmap` and less error prone, as well as faster.

GitOrigin-RevId: 862989348614520672065786c5607f0ade7a93e7
@tdas
Copy link
Contributor

tdas commented Nov 16, 2022

Can you crosslink to the issue in the description as well?

@vkorukanti
Copy link
Collaborator Author

Can you crosslink to the issue in the description as well?

Linked to the main issue which also contains the small project plan

}

/** Add all values in `range` to the container. */
def addRange(range: NumericRange[Long]): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of methods... do you think you will need all of these methods for the full DV implementation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are going to be mostly used in tests when you can generate a custom DV for a given set of indices. Let me mark it as for tests only.

@vkorukanti vkorukanti changed the title Add RoaringBitmapArray as replacement for Roaring64Bitmap Add RoaringBitmapArray to store indices of rows deleted in a data file Nov 17, 2022
GitOrigin-RevId: 965d1294798dae96f881f6c1ce7fdd46a55f8eab
Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure if all of the method you have added are going to useful, but I trust your plan in the issue. At the end of all your PRs, it will be good to double check whether all the methods are actually being used and remove any crud.

@vkorukanti vkorukanti added this to the 2.2.0 milestone Dec 5, 2022
@vkorukanti vkorukanti deleted the dv1 branch October 2, 2023 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants