Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Secondary index #360

Closed
yingfeng opened this issue Dec 25, 2023 · 3 comments
Closed

[Feature Request]: Secondary index #360

yingfeng opened this issue Dec 25, 2023 · 3 comments
Assignees
Labels
feature request New feature or request

Comments

@yingfeng
Copy link
Member

Secondary index is used for numeric filtering. It is composed of two parts:

  1. The data of each numeric column is stored in an inverted sorted form, with a compressed format.
  2. Another in-memory part of index data which is based on pgm, it could provide very fast approximate range query with bounded error, which has already been added into the repository.

The mechanism of range filtering of secondary index is as follows:

  1. Query the pgm index to get the bounded range.
  2. Scan the raw index data according to bounded range to get the RowIDs of the query filter.
@yingfeng yingfeng mentioned this issue Dec 25, 2023
78 tasks
@yingfeng yingfeng assigned KKould and unassigned KKould Jan 2, 2024
@yangzq50
Copy link
Contributor

yangzq50 commented Jan 18, 2024

My proposal for the implementation:

  1. When building secondary index for a numeric column in the table, we can build one secondary index structure for each segment of the table.

  2. Every secondary index structure should contain 3 parts:
    2.1. pgm: a data structure which provides approximate range
    2.2. vector_value: sorted array of all the values of the chosen column in the segment (necessary for getting exact range)
    2.3. vector_offset: array of segment offsets for each value in vector_value (necessary for getting filtering results)

  3. vector_value and vector_offset can be split into parts with max size DEFAULT_BLOCK_CAPACITY

  4. Steps for range search of one column with secondary index:
    4.1. Get an approximate range from pgm
    4.2. Get the exact range by binary search in vector_value
    4.3. Generate a bitmask (when the size of the exact range is too big) or a vector of chosen offsets

@JinHai-CN JinHai-CN added the feature request New feature or request label Jan 18, 2024
@yingfeng
Copy link
Member Author

4.2 is not necessary because pgm has already told the approximate range, as a result 4.2 is acturally a linear scan

4. Steps for range search of one column with secondary index:
   4.1. Get an approximate range from pgm
   4.2. Get the exact range by binary search in vector_value
   4.3. Generate a bitmask (when the size of the exact range is too big) or a vector of chosen offsets

@yangzq50
Copy link
Contributor

subtask #476 and subtask #477 are deferred to #637

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants