Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick2 #9

Open
wants to merge 11 commits into
base: feature/vector-index
Choose a base branch
from
Open

Cherry pick2 #9

wants to merge 11 commits into from

Conversation

CalvinNeo
Copy link
Owner

@CalvinNeo CalvinNeo commented Aug 13, 2024

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • New Features

    • Introduced a new contains method in the LRUCache for non-intrusive key checks.
    • Added a VectorIndexCache for managing vector index entries with LRU caching.
    • Implemented a threading mechanism for cache management and performance context tracking for vector searches.
  • Bug Fixes

    • Enhanced error handling and logging for file caching operations with S3.
  • Documentation

    • Updated comments and documentation to reflect changes in vector index handling and caching mechanisms.
  • Refactor

    • Restructured vector index management and introduced new data types for improved clarity and maintainability.
    • Simplified method signatures for better readability and alignment with the new architecture.
  • Tests

    • Expanded test coverage for vector indexing and caching scenarios, including S3 interactions and performance metrics.

breezewish and others added 11 commits August 12, 2024 15:54
Signed-off-by: Wish <breezewish@outlook.com>
Signed-off-by: Wish <breezewish@outlook.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Copy link

coderabbitai bot commented Aug 13, 2024

Walkthrough

The recent changes focus on enhancing the vector indexing and file caching mechanisms within the database management system. Key modifications include the introduction of new classes and structures for better handling of vector indices, improvements in performance tracking for file caching operations, and updates to ensure more robust error handling. These updates streamline the codebase, enhance type safety, and improve overall efficiency in managing data retrieval and caching.

Changes

Files Change Summary
dbms/src/Common/FailPoint.cpp Added a new fail point for testing related to file cache downloads.
dbms/src/Common/LRUCache.h Introduced a contains method to check for key existence in the LRU cache without altering its order.
dbms/src/Interpreters/Context.cpp Refactored vector index cache handling methods, changing cache size parameter to number of cache entities.
dbms/src/Interpreters/Context.h Updated setVectorIndexCache method to reflect the new parameter name related to cached entities.
dbms/src/Server/Server.cpp Replaced vector index cache size definition with a new parameter specifying the number of cached entities.
dbms/src/Storages/DeltaMerge/DMContext.h Added an optional parameter to the DMContext constructor for better flexibility.
dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h Changed type of vector_index in ColumnDefine from VectorIndexInfo to VectorIndexDefinition.
dbms/src/Storages/DeltaMerge/File/ColumnStat.h Updated vector_index type to VectorIndexFileProps, reflecting changes in indexing data structures.
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp Introduced new input stream class for efficient vector index data retrieval.
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h Refactored class methods and member variables for improved clarity and modularity.
dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp Updated vector index handling logic, switching to a builder pattern for better encapsulation.
dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp Modified function signatures and logic for better handling of vector indices.
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp Refactored to introduce VectorIndexBuilder and VectorIndexViewer, improving modularity.
dbms/src/TiDB/Schema/TiDB.cpp Enhanced vector index parsing and validation, adopting new protocol buffer definitions.
dbms/src/S3/FileCache.cpp Improved file caching mechanisms, adding methods for better file download management from S3.
dbms/src/S3/FileCachePerf.cpp Introduced thread-local performance tracking for file caching operations.
dbms/src/S3/FileCachePerf.h Defined a performance context struct for file caching metrics, enhancing performance analysis.
dbms/src/TiDB/Schema/TiDB.h Changed vector_index type in ColumnInfo to VectorIndexDefinitionPtr, aligning with new structures.
dbms/src/TiDB/Schema/VectorIndex.h Removed obsolete enums and restructured vector index representation for clarity and type safety.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Cache
    participant FileSystem

    User->>Cache: Request File
    Cache->>FileSystem: Check File Status
    alt File Not Cached
        FileSystem->>Cache: Download File
        Cache->>User: Return File
    else File Cached
        Cache->>User: Return Cached File
    end
Loading

🐰 In the meadow, hopping around,
New changes sprout from the ground!
Vector indices dance with glee,
Caches and files, as easy as can be!
So let’s celebrate with a cheer,
For a brighter database year! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@CalvinNeo
Copy link
Owner Author

@coderabbitai review

Copy link

coderabbitai bot commented Aug 13, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range, codebase verification and nitpick comments (9)
dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1)

44-44: Ensure Serialization and Deserialization of std::optional<dtpb::VectorIndexFileProps>

The toProto and mergeFromProto methods in ColumnStat.h do not currently handle the std::optional<dtpb::VectorIndexFileProps>. Ensure that these methods are updated to serialize and deserialize this field correctly to prevent potential issues.

  • File: dbms/src/Storages/DeltaMerge/File/ColumnStat.h
  • Methods: toProto, mergeFromProto
Analysis chain

Verify the impact on serialization and deserialization.

The change to std::optional<dtpb::VectorIndexFileProps> may affect how vector_index is serialized and deserialized. Ensure that all related methods handle the new type correctly.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the impact of the type change on serialization and deserialization.

# Test: Search for methods related to serialization and deserialization of `ColumnStat`.
rg --type cpp -A 3 $'toProto|mergeFromProto|serializeToBuffer|parseFromBuffer'

Length of output: 8784


Script:

#!/bin/bash
# Extract the implementation of `toProto` and `mergeFromProto` methods from `ColumnStat.h`
ast-grep --lang cpp --pattern $'dtpb::ColumnStat toProto() const { $$$ }' dbms/src/Storages/DeltaMerge/File/ColumnStat.h
ast-grep --lang cpp --pattern $'void mergeFromProto(const dtpb::ColumnStat & proto) { $$$ }' dbms/src/Storages/DeltaMerge/File/ColumnStat.h

Length of output: 262


Script:

#!/bin/bash
# Use ripgrep to search for the definitions of `toProto` and `mergeFromProto` methods in `ColumnStat.h`
rg --type cpp 'dtpb::ColumnStat toProto\(\) const' dbms/src/Storages/DeltaMerge/File/ColumnStat.h -A 10
rg --type cpp 'void mergeFromProto\(const dtpb::ColumnStat & proto\)' dbms/src/Storages/DeltaMerge/File/ColumnStat.h -A 10

Length of output: 1164

dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)

104-108: Update documentation for addStreams method.

The addStreams method signature has been updated to use TiDB::VectorIndexDefinitionPtr instead of TiDB::VectorIndexInfoPtr. Ensure that any related documentation or comments are updated to reflect this change.

dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)

83-103: Consider Logging Additional Information in Destructor.

The destructor logs the completion of reading a DMFile with vector index. Consider logging additional context, such as the number of rows processed or any errors encountered, to enhance traceability.

-  "Finished read DMFile with vector index for column dmf_{}/{}(id={}), "
+  "Finished read DMFile with vector index for column dmf_{}/{}(id={}), processed_rows={}, "

106-123: Clarify Filtering Logic in read Method.

The read method contains logic for filtering based on a FilterPtr. Ensure that the comments and logic are clear, especially regarding the implications of return_filter being false.

-  // If return_filter == false, we must filter by ourselves.
+  // If return_filter == false, apply filtering within this method.

170-215: Consider Caching Results in readByIndexReader.

The method readByIndexReader iterates over packs and reads vector columns. Consider caching results if this method is called frequently with the same parameters to improve performance.


258-271: Add Logging for Load Status in load Method.

Consider adding logging to indicate when the load method is invoked and completed, which can be useful for debugging and performance monitoring.

+  LOG_DEBUG(log, "Loading vector index and search results");

477-485: Clarify the Purpose of getPackIdFromBlock.

The method getPackIdFromBlock retrieves the pack ID from a block. Ensure that the purpose and usage of this method are well-documented, especially regarding assumptions about block alignment.

-  // The start offset of a block is ensured to be aligned with the pack.
+  // Assumes the start offset of a block is aligned with the pack.
dbms/src/Storages/S3/FileCache.cpp (2)

68-107: Use named constants for timeout values.

The method uses magic numbers (30 and 300 seconds) for timeouts. Consider defining named constants for these values to improve readability and maintainability.

constexpr auto WAIT_TIMEOUT = std::chrono::seconds(30);
constexpr auto MAX_WAIT_TIMEOUT = std::chrono::seconds(300);

160-176: Enhance logging for file download attempts.

Consider adding more detailed logging to trace the execution flow, especially when a file is removed from the cache due to inconsistencies.

LOG_DEBUG(log, "Attempting to download file for local read: {}", s3_fname.toFullKey());
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 32d911b and ff13cae.

Files selected for processing (36)
  • contrib/tipb (1 hunks)
  • dbms/src/Common/FailPoint.cpp (1 hunks)
  • dbms/src/Common/LRUCache.h (1 hunks)
  • dbms/src/Interpreters/Context.cpp (3 hunks)
  • dbms/src/Interpreters/Context.h (1 hunks)
  • dbms/src/Server/Server.cpp (1 hunks)
  • dbms/src/Storages/DeltaMerge/DMContext.h (2 hunks)
  • dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (1 hunks)
  • dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (5 hunks)
  • dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (4 hunks)
  • dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4 hunks)
  • dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (5 hunks)
  • dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (4 hunks)
  • dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (3 hunks)
  • dbms/src/Storages/DeltaMerge/Index/RSIndex.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (5 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2 hunks)
  • dbms/src/Storages/DeltaMerge/ScanContext.h (6 hunks)
  • dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (26 hunks)
  • dbms/src/Storages/S3/FileCache.cpp (8 hunks)
  • dbms/src/Storages/S3/FileCache.h (6 hunks)
  • dbms/src/Storages/S3/FileCachePerf.cpp (1 hunks)
  • dbms/src/Storages/S3/FileCachePerf.h (1 hunks)
  • dbms/src/TiDB/Schema/TiDB.cpp (3 hunks)
  • dbms/src/TiDB/Schema/TiDB.h (1 hunks)
  • dbms/src/TiDB/Schema/VectorIndex.h (1 hunks)
Files skipped from review due to trivial changes (1)
  • contrib/tipb
Additional comments not posted (76)
dbms/src/Storages/S3/FileCachePerf.cpp (1)

1-22: LGTM!

The file sets up a thread-local performance context for file caching. The setup is straightforward and aligns with typical practices for performance tracking.

dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (1)

1-22: LGTM!

The file sets up a thread-local performance context for vector search operations. The setup is straightforward and aligns with typical practices for performance tracking.

dbms/src/Storages/DeltaMerge/Index/RSIndex.h (1)

Line range hint 1-26:
Simplification acknowledged and approved.

The removal of the vector member and its constructor simplifies the RSIndex structure, focusing on the essential components. This change enhances clarity and reduces unnecessary complexity.

dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1)

24-28: LGTM! Modular design enhances clarity and maintainability.

The introduction of VectorIndexBuilder and VectorIndexViewer classes, along with their smart pointer types, suggests a more modular approach. This can improve code clarity and maintainability by separating concerns between building and viewing vector indices.

dbms/src/Storages/S3/FileCachePerf.h (1)

1-37: LGTM! Effective use of thread-local storage for performance tracking.

The FileCachePerfContext structure provides a clear way to track performance metrics related to file caching. Using thread_local for file_cache ensures that performance data is managed per thread, which is suitable for concurrent environments.

dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1)

1-37: LGTM! Efficient thread-local performance tracking for vector search.

The VectorSearchPerfContext structure is well-designed for tracking performance metrics in vector search operations. The use of thread_local for vector_search ensures that performance data is isolated per thread, which is advantageous in concurrent processing.

dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (2)

27-39: Review VectorIndexHNSWBuilder class changes.

The class VectorIndexHNSWBuilder has been introduced, replacing USearchIndexWithSerialization. The constructor now takes a VectorIndexDefinitionPtr, which suggests a design shift towards using definitions for initialization. Ensure that the new design aligns with the overall architecture and that the addBlock and save methods are correctly implemented.

Verification successful

VectorIndexHNSWBuilder Class Implementation Verified

The VectorIndexHNSWBuilder class and its methods addBlock and save are implemented correctly, with appropriate type checks and error handling. The integration with the rest of the codebase is consistent with the new design. No issues found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of the `VectorIndexHNSWBuilder` class.

# Test: Search for the class definition and usage. Expect: Consistent usage with new constructor.
rg --type cpp -A 5 $'class VectorIndexHNSWBuilder'

Length of output: 652


Script:

#!/bin/bash
# Description: Find all usages of the `VectorIndexHNSWBuilder` class to verify its integration and usage.

# Test: Search for all instances where `VectorIndexHNSWBuilder` is used in the codebase.
rg --type cpp 'VectorIndexHNSWBuilder'

Length of output: 867


Script:

#!/bin/bash
# Description: Verify the implementation of the `addBlock` and `save` methods in `Index.cpp`.

# Test: Retrieve the implementation details of `addBlock` and `save` methods.
rg --type cpp -A 10 'void VectorIndexHNSWBuilder::addBlock' dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp
rg --type cpp -A 10 'void VectorIndexHNSWBuilder::save' dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp

Length of output: 1248


41-55: Review VectorIndexHNSWViewer class introduction.

The VectorIndexHNSWViewer class extends VectorIndexViewer and introduces a static view method. The search method has been updated to remove the SearchStatistics parameter, focusing on query information and valid rows. Ensure that this change does not impact the functionality negatively and aligns with the intended design.

Verification successful

Removal of SearchStatistics Parameter Verified

The removal of the SearchStatistics parameter from the search method in the VectorIndexHNSWViewer class does not appear to impact functionality negatively. There are no references to SearchStatistics in the codebase, indicating it was not crucial to the method's operation. This change aligns with the intended design.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of the `VectorIndexHNSWViewer` class.

# Test: Search for the class definition and usage. Expect: Proper integration with the new search method.
rg --type cpp -A 5 $'class VectorIndexHNSWViewer'

Length of output: 650


Script:

#!/bin/bash
# Description: Search for the usage of the `search` method in the `VectorIndexHNSWViewer` class.

# Test: Find all occurrences of `search` method calls. Expect: Understand how the method is used and ensure the change is safe.
rg --type cpp 'search\(' -A 3

Length of output: 22355


Script:

#!/bin/bash
# Description: Search for any references to `SearchStatistics` in relation to the `search` method.

# Test: Find all occurrences of `SearchStatistics` to understand its previous usage and impact.
rg --type cpp 'SearchStatistics'

Length of output: 32


Script:

#!/bin/bash
# Description: Search for comments or documentation mentioning `SearchStatistics`.

# Test: Find any comments or documentation that reference `SearchStatistics`.
rg --type cpp 'SearchStatistics' -g '*.h' -g '*.cpp' -A 3 -B 3

Length of output: 62

dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1)

30-77: Review VectorIndexCache class implementation.

The VectorIndexCache class uses an LRUCache for caching VectorIndexViewer objects. It includes mechanisms for cleaning outdated cache entries and managing shutdown operations. Ensure that the use of mutexes and condition variables is correct and that the cleaning logic is efficient and does not introduce race conditions.

Verification successful

Integration and Usage of VectorIndexCache Verified

The VectorIndexCache class is well-integrated within the codebase, with clear methods for managing its lifecycle. It is used in various components, including the server setup and context management, and its functionality is covered by unit tests.

  • Server.cpp: Sets the vector index cache in the global context.
  • Context.cpp: Provides methods to set, get, and drop the vector index cache.
  • Test files confirm the cache's functionality is tested.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of the `VectorIndexCache` class.

# Test: Search for the class definition and usage. Expect: Proper integration with cache management logic.
rg --type cpp -A 5 $'class VectorIndexCache'

Length of output: 1190


Script:

#!/bin/bash
# Description: Search for the usage of `VectorIndexCache` to understand its integration and usage.

# Test: Find all occurrences of `VectorIndexCache` to verify its integration.
rg --type cpp 'VectorIndexCache'

Length of output: 3577

dbms/src/TiDB/Schema/VectorIndex.h (2)

27-35: Review VectorIndexDefinition struct introduction.

The VectorIndexDefinition struct replaces previous enums and structures, encapsulating index properties. Ensure that the struct is comprehensive and correctly represents the necessary properties. The TODO comment suggests potential future additions, which should be tracked.

Verification successful

The VectorIndexDefinition struct is consistently integrated.

The VectorIndexDefinition struct is used across various components and is tested, indicating proper integration. The struct's usage in different contexts, including formatters for output, suggests it is well-incorporated into the codebase. No issues found with its current implementation.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of the `VectorIndexDefinition` struct.

# Test: Search for the struct definition and usage. Expect: Consistent usage with new properties.
rg --type cpp -A 5 $'struct VectorIndexDefinition'

Length of output: 497


Script:

#!/bin/bash
# Description: Search for usage of `VectorIndexDefinition` in the codebase to verify its integration.

# Test: Find all occurrences of `VectorIndexDefinition` usage.
rg --type cpp 'VectorIndexDefinition'

Length of output: 3281


Line range hint 48-73:
Review fmt::formatter for VectorIndexDefinition.

The fmt::formatter specialization for VectorIndexDefinition has been updated. Ensure that the formatting logic correctly reflects the new struct and that it integrates well with the existing logging or output mechanisms.

dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (3)

Line range hint 33-41:
Refactor to VectorIndexBuilder: Ensure type support logic is correct.

The isSupportedType method has been moved to VectorIndexBuilder. This refactoring aligns with the new class responsibilities. Verify that the logic for checking supported types remains correct and comprehensive.


59-73: Refactor to VectorIndexViewer: Ensure view logic is correct.

The view method replaces load and now uses VectorIndexFileProps. This refactoring is intended to improve clarity and modularity. Verify that the logic for viewing vector index files is correctly implemented and that all relevant cases are handled.


42-56: Refactor to VectorIndexBuilder: Check method signature and logic.

The create method now uses VectorIndexDefinitionPtr instead of VectorIndexInfo. This change enhances encapsulation and type safety. Ensure that all usages of create are updated to match the new signature and that the switch-case logic correctly handles all necessary cases.

Verification successful

Verification Successful: create Method Signature Updated Correctly

The usage of VectorIndexBuilder::create has been successfully updated to the new signature using VectorIndexDefinitionPtr. All identified usages are consistent with this change.

  • dbms/src/Storages/DeltaMerge/File/DMFileWriter.h: The do_vector_index parameter is correctly typed as TiDB::VectorIndexDefinitionPtr.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `create` method are updated to the new signature.

# Test: Search for the `create` method usage. Expect: Only occurrences of the new signature.
rg --type cpp -A 5 $'VectorIndexBuilder::create'

Length of output: 1216


Script:

#!/bin/bash
# Description: Check the type of `do_vector_index` in `DMFileWriter.h` to verify it matches the new signature.
rg --type cpp 'do_vector_index' dbms/src/Storages/DeltaMerge/File/DMFileWriter.h -A 5 -B 5

Length of output: 1700

dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (2)

31-55: Introduction of VectorIndexBuilder: Review class structure and methods.

The VectorIndexBuilder class is introduced to handle the construction of vector indices. The methods create, isSupportedType, addBlock, and save are aligned with its responsibilities. Ensure that the class design supports efficient and correct index building.


58-90: Introduction of VectorIndexViewer: Review class structure and methods.

The VectorIndexViewer class is introduced to handle viewing vector index files. The methods view, search, and get are aligned with its responsibilities. Ensure that the class design supports efficient and correct index viewing.

dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (3)

47-47: Update to VectorIndexViewerPtr: Ensure index interaction is correct.

The index member variable now uses VectorIndexViewerPtr, reflecting the new class structure. Verify that all interactions with the index are correctly updated to use the viewer.


56-56: Update calcResultsByPack method signature: Ensure efficient result handling.

The method now accepts a reference to a vector of VectorIndexViewer::Key, improving efficiency by avoiding unnecessary copies. Ensure that all calls to this method are updated to match the new signature.


65-71: Update constructor: Ensure alignment with new indexing strategy.

The constructor now takes VectorIndexViewerPtr and a reference to a vector of VectorIndexViewer::Key. This change aligns with the new indexing strategy. Ensure that all instances of this class are constructed using the updated parameters.

dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (3)

93-98: Destructor implementation looks good.

The destructor correctly handles shutting down the cleaner thread and joining it.


86-91: Verify thread management.

Ensure that the cleaner thread is properly managed and terminated to prevent resource leaks.

Verification successful

Thread management is correctly implemented in VectorIndexCache.

The destructor of VectorIndexCache ensures proper termination of the cleaner thread by setting a shutdown flag, notifying the condition variable, and joining the thread. This prevents resource leaks and ensures clean shutdown.

  • Destructor: VectorIndexCache::~VectorIndexCache() sets is_shutting_down = true, notifies shutdown_cv, and calls cleaner_thread.join().
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify proper thread management in the constructor of `VectorIndexCache`.

# Test: Check for proper thread management and termination.
rg --type cpp 'VectorIndexCache' -A 5

Length of output: 16167


63-83: Ensure proper shutdown handling.

The loop correctly handles shutdown signals, but ensure that any resources are properly released if exceptions occur.

Verification successful

Proper shutdown handling and thread management confirmed.

The cleanOutdatedLoop method and its associated thread are correctly managed. The destructor properly sets the shutdown flag, notifies the condition variable, and joins the thread, ensuring a clean and safe shutdown process. No further action is needed.

  • Destructor: VectorIndexCache::~VectorIndexCache() properly joins the cleaner_thread.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify proper shutdown handling in the `cleanOutdatedLoop`.

# Test: Check for any resources that might not be released upon exceptions.
rg --type cpp 'cleanOutdatedLoop' -A 5

Length of output: 1429


Script:

#!/bin/bash
# Description: Check if `cleaner_thread` is properly managed in the `VectorIndexCache` class.

# Test: Look for thread management (join or detach) in the destructor or elsewhere in the class.
rg --type cpp 'cleaner_thread' -A 10

Length of output: 1717

dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (2)

90-94: Addition of VectorIndexFileProps looks good.

The new message is well-defined and should enhance the handling of vector index properties.


65-67: Verify backward compatibility.

The change from ColumnVectorIndexInfo to VectorIndexFileProps may affect backward compatibility. Ensure that the reserved field is correctly handled in the codebase.

Verification successful

Backward compatibility maintained.

The change from ColumnVectorIndexInfo to VectorIndexFileProps does not impact backward compatibility, as there are no remaining references to ColumnVectorIndexInfo in the codebase. The use of the reserved field 101 ensures that the field number is not reused.

  • The reserved field 101 is correctly implemented in dmfile.proto.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify backward compatibility for changes in `ColumnStat`.

# Test: Check for usage of the reserved field and ensure compatibility.
rg --type proto 'reserved 101' -A 5

Length of output: 67


Script:

#!/bin/bash
# Description: Verify backward compatibility for changes in `ColumnStat`.

# Test: Check for usage of the reserved field and ensure compatibility.
rg 'reserved 101' --glob '*.proto' -A 5

Length of output: 540


Script:

#!/bin/bash
# Description: Check for references to `ColumnVectorIndexInfo` in the codebase to ensure backward compatibility.

# Test: Search for `ColumnVectorIndexInfo` in the codebase.
rg 'ColumnVectorIndexInfo'

Length of output: 26

dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2)

96-96: Verify impact of access modifier change for serializeColumnFileSet.

Changing the access modifier from public to private may affect accessibility. Ensure that this change is intentional and does not introduce any issues.

Verification successful

Access modifier change for serializeColumnFileSet is safe.

The method serializeColumnFileSet is only used within the Serializer.cpp file and does not have external dependencies. Changing its access modifier to private improves encapsulation without affecting other parts of the codebase.

  • No external usage of serializeColumnFileSet found outside Serializer.cpp.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the impact of changing the access modifier for `serializeColumnFileSet`.

# Test: Check for any unintended consequences of making `serializeColumnFileSet` private.
rg --type cpp 'serializeColumnFileSet' -A 5

Length of output: 4711


96-96: Verify impact of access modifier change for serializeSegment.

Changing the access modifier from private to public may affect encapsulation. Ensure that this change is intentional and does not introduce any issues.

dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (2)

119-119: Type consistency maintained.

The update to use VectorIndexViewer::Key for rowid ensures consistency with the new type usage across the file.


37-47: Consider the performance impact of the runtime check.

The addition of a runtime check to ensure that sorted_results is sorted improves robustness. However, this could introduce a performance overhead in critical paths. Consider evaluating the necessity of this check in production environments or providing a build configuration to disable it.

dbms/src/Storages/DeltaMerge/DMContext.h (1)

136-137: Verify the usage of the new scan_context parameter.

The addition of the scan_context parameter enhances flexibility, but ensure that its usage is correctly handled throughout the codebase.

dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4)

75-75: Use builder pattern for vector index creation.

The vector_index is now created using VectorIndexBuilder::create, which enhances flexibility. Ensure that the builder pattern is correctly implemented and utilized.


101-101: Update member variable type.

The vector_index member variable type has been updated to VectorIndexBuilderPtr. Ensure that this change is reflected in all relevant parts of the codebase.

Verification successful

Update member variable type verified.

The vector_index member variable type has been updated to VectorIndexBuilderPtr in the relevant parts of the codebase. No inconsistencies were found in the application of this change.

  • The vector_index member variable in DMFileWriter.h has been correctly updated to VectorIndexBuilderPtr.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `vector_index` member variable are updated to use `VectorIndexBuilderPtr`.

# Test: Search for `vector_index` usage. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'vector_index'

Length of output: 30888


57-57: Update constructor parameter type.

The constructor now uses TiDB::VectorIndexDefinitionPtr instead of TiDB::VectorIndexInfoPtr. Ensure that all instances of this class are updated to pass the correct type.


163-163: Update method parameter type.

The addStreams method now uses TiDB::VectorIndexDefinitionPtr instead of TiDB::VectorIndexInfoPtr. Ensure that all calls to this method are updated to pass the correct type.

Verification successful

All addStreams method calls use the correct parameter type.

The addStreams method now correctly uses TiDB::VectorIndexDefinitionPtr for the do_vector_index parameter, and all relevant calls to this method have been updated accordingly. The verification confirms that the type change has been properly implemented.

  • DMFileWriter.cpp: The call to addStreams uses cd.vector_index, which is correctly defined as TiDB::VectorIndexDefinitionPtr.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all calls to `addStreams` use the correct type for `do_vector_index`.

# Test: Search for `addStreams` calls. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'addStreams'

Length of output: 1531


Script:

#!/bin/bash
# Use ast-grep to find all calls to the addStreams method and check the type of the do_vector_index parameter.
ast-grep --lang cpp --pattern 'addStreams($_, $_, $_, $_)' 

Length of output: 60


Script:

#!/bin/bash
# Use rg to find all calls to the addStreams method and include context lines to check parameter types.
rg --type cpp 'addStreams' -A 3 -B 2

Length of output: 1678


Script:

#!/bin/bash
# Search for the definition of `cd` to confirm the type of `cd.vector_index`.
rg --type cpp 'cd\s*=' -A 3 -B 2

Length of output: 11390

dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (2)

95-95: Update member variable type.

The vector_index member variable type has been updated to TiDB::VectorIndexDefinitionPtr. Ensure that this change is reflected in all relevant parts of the codebase.

Verification successful

Verification complete: Member variable type updated consistently.

The vector_index member variable type change to TiDB::VectorIndexDefinitionPtr in ColumnDefine has been consistently applied across the codebase. No discrepancies found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `vector_index` member variable in `ColumnDefine` are updated to use `TiDB::VectorIndexDefinitionPtr`.

# Test: Search for `vector_index` usage in `ColumnDefine`. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'ColumnDefine'

Length of output: 324094


102-102: Update constructor parameter type.

The constructor now uses TiDB::VectorIndexDefinitionPtr for vector_index_. Ensure that all instances of this struct are updated to pass the correct type.

Verification successful

No issues found with ColumnDefine constructor usage. The search did not find any instances of ColumnDefine being instantiated with the vector_index_ parameter, indicating that the codebase does not directly use this parameter in constructor calls.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all instantiations of the `ColumnDefine` struct use the correct type for `vector_index_`.

# Test: Search for `ColumnDefine` instantiations. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'ColumnDefine'

Length of output: 324094


Script:

#!/bin/bash
# Search for all constructor calls for `ColumnDefine` that include the `vector_index_` parameter.
ast-grep --lang cpp --pattern 'ColumnDefine($_, $_, $_, $_, $_)' --json

Length of output: 76

dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (7)

96-96: Override destructor.

The destructor now explicitly uses the override keyword, which improves clarity and ensures that the base class has a virtual destructor.


108-108: Override method signature.

The read method now explicitly uses the override keyword, which improves clarity and ensures that the base class has a virtual method.


144-144: Declare method without implementation.

The readByIndexReader method is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.


148-148: Declare method without implementation.

The readByFollowingOtherColumns method is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.


182-182: Update member variable type.

The vec_index member variable type has been updated to VectorIndexViewerPtr. Ensure that this change is reflected in all relevant parts of the codebase.

Verification successful

Member variable type update verified.

The vec_index member variable type has been correctly updated to VectorIndexViewerPtr in all relevant parts of the codebase. No further action is needed.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `vec_index` member variable are updated to use `VectorIndexViewerPtr`.

# Test: Search for `vec_index` usage. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'vec_index'

Length of output: 15367


186-186: Update member variable name and purpose.

The valid_rows_after_search variable has been replaced with sorted_results, indicating a change in how results are managed. Ensure that this change is reflected in all relevant parts of the codebase.

Verification successful

Change from valid_rows_after_search to sorted_results is correctly reflected.

The sorted_results variable is used consistently across the codebase, and there are no remaining references to valid_rows_after_search. The transition appears to be complete and correctly implemented.

  • Files checked:
    • DMFileWithVectorIndexBlockInputStream.h
    • VectorColumnFromIndexReader.h
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `sorted_results` member variable are updated accordingly.

# Test: Search for `sorted_results` usage. Expect: Only occurrences with the new name and purpose.
rg --type cpp -A 5 $'sorted_results'

Length of output: 11304


113-113: Declare method without implementation.

The readImpl method is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.

Verification successful

Implementation Found for readImpl Method

The readImpl method declared in DMFileWithVectorIndexBlockInputStream.h is implemented in the corresponding source file DMFileWithVectorIndexBlockInputStream.cpp. This addresses the concern regarding the method's implementation. No further action is required.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `readImpl` is provided in the corresponding source file.

# Test: Search for `readImpl` implementation. Expect: Implementation in the corresponding source file.
rg --type cpp -A 5 $'readImpl'

Length of output: 88199

dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (5)

38-54: LGTM!

The getUSearchMetricKind function correctly maps distance metrics and handles unsupported metrics with a runtime check.


58-64: Constructor looks good!

The VectorIndexHNSWBuilder constructor correctly initializes the index with the appropriate metric and checks the index kind.


Line range hint 67-106: Robust method implementation!

The addBlock method effectively handles memory allocation and data validity checks, ensuring robust data addition to the index.


110-113: Save method is well-implemented!

The save method effectively handles errors using runtime checks, ensuring the index is saved correctly.


Line range hint 134-198: Comprehensive search method!

The search method includes detailed checks and exception handling, ensuring robust search functionality. Note the TODO comment regarding efSearch support.

dbms/src/Common/LRUCache.h (1)

74-80: Thread-safe contains function!

The contains function correctly checks for the presence of a key using a lock guard, ensuring thread safety without altering the LRU order.

dbms/src/Storages/S3/FileCache.h (4)

228-233: Efficient caching strategy!

The downloadFileForLocalRead method effectively ensures that files in use are not evicted prematurely, enhancing caching strategy.


249-253: Robust retrieval strategy!

The getOrWait method provides a robust mechanism to ensure file availability, handling waiting and foreground downloading effectively.


256-256: Foreground download method looks good!

The fgDownload method is well-implemented to handle file downloads in the foreground, ensuring correct processing.


75-76: Verify waitForNotEmpty implementation.

Ensure that the waitForNotEmpty method correctly blocks until the segment is populated, maintaining thread safety.

Verification successful

waitForNotEmpty Method Implementation Verified

The waitForNotEmpty method in FileSegment is correctly implemented to block until the segment is populated, using mutexes and condition variables to ensure thread safety. The method includes a timeout to prevent indefinite blocking.

  • File: dbms/src/Storages/S3/FileCache.cpp
  • Implementation Details: Uses a mutex (mtx) and a condition variable (cv_ready) to wait for the status to change from Empty.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `waitForNotEmpty` method in `FileSegment`.

# Test: Check if `waitForNotEmpty` is implemented and used correctly.
ast-grep --lang cpp --pattern $'Status FileSegment::waitForNotEmpty() {
  $$$
}'

Length of output: 81


Script:

#!/bin/bash
# Description: Search for any method named `waitForNotEmpty` in the entire codebase to locate its implementation or usage.

# Search for `waitForNotEmpty` method across all C++ files.
rg 'waitForNotEmpty' --type cpp -A 5

Length of output: 2284

dbms/src/TiDB/Schema/TiDB.h (1)

209-209: Verify the impact of changing vector_index type.

The type of vector_index in the ColumnInfo struct has been changed from VectorIndexInfoPtr to VectorIndexDefinitionPtr. This change may affect how vector indices are used throughout the codebase. Ensure that all usages of vector_index are updated to accommodate this new type.

dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)

64-64: Ensure VectorIndexBuilder::isSupportedType is correctly implemented.

The change from VectorIndex::isSupportedType to VectorIndexBuilder::isSupportedType suggests a restructuring of vector index handling. Verify that the new method is implemented correctly and covers all necessary type checks.

Verification successful

Verification of VectorIndexBuilder::isSupportedType Implementation

The VectorIndexBuilder::isSupportedType method is correctly implemented. It checks for nullable types and supports DataTypeFloat32. This implementation aligns with its intended use in DMFileWriter.cpp to ensure only supported types are processed for vector indexing.

  • File: dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp
  • Method: bool VectorIndexBuilder::isSupportedType(const IDataType & type)
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `VectorIndexBuilder::isSupportedType`.

# Test: Search for the implementation of `isSupportedType`. Expect: Correct type checks.
ast-grep --lang cpp --pattern $'class VectorIndexBuilder {
  $$$
  static bool isSupportedType($_) {
    $$$
  }
  $$$  
}'

Length of output: 469


Script:

#!/bin/bash
# Search for the definition and usage of `isSupportedType` in the codebase
rg 'isSupportedType' --context 5 --type cpp

# Search for the definition of the `VectorIndexBuilder` class
rg 'class VectorIndexBuilder' --context 10 --type cpp

Length of output: 5909

dbms/src/Storages/DeltaMerge/ScanContext.h (2)

93-93: Initialize total_vector_idx_load_from_s3 correctly.

The new atomic variable total_vector_idx_load_from_s3 is introduced to track vector index load times from S3. Ensure that it is correctly initialized and used throughout the class.


150-150: Verify integration of total_vector_idx_load_from_s3 in serialization and merging.

The new variable total_vector_idx_load_from_s3 is integrated into serialization, deserialization, and merging logic. Verify that these changes are correctly implemented and that the variable is accurately tracked and aggregated.

Also applies to: 202-202, 262-262, 314-314

Verification successful

Integration of total_vector_idx_load_from_s3 is correctly implemented.

The variable total_vector_idx_load_from_s3 is properly integrated into serialization and merging logic. It is serialized and deserialized using tiflash_scan_context_pb, and its value is aggregated during merging. Additionally, its behavior is verified through multiple tests. No issues found with its integration.

  • Serialization and Deserialization: Properly handled via tiflash_scan_context_pb.
  • Merging Logic: Correctly aggregated with other contexts.
  • Testing: Adequately tested in gtest_dm_vector_index.cpp.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the integration of `total_vector_idx_load_from_s3` in serialization, deserialization, and merging.

# Test: Search for `total_vector_idx_load_from_s3` usage in serialization and merging. Expect: Correct integration.
rg --type cpp --word-regexp 'total_vector_idx_load_from_s3'

Length of output: 2866

dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)

125-168: Optimize Filter Assignment in readImpl Method.

The filter assignment in readImpl could be optimized by using a more efficient algorithm or data structure, especially if sorted_results is large.

-  auto it = std::lower_bound(sorted_results.begin(), sorted_results.end(), start_offset);
-  while (it != sorted_results.end())
-  {
-      auto rowid = *it;
-      if (rowid >= max_rowid_exclusive)
-          break;
-      filter[rowid - start_offset] = 1;
-      ++it;
-  }
+  for (auto rowid : sorted_results)
+  {
+      if (rowid >= start_offset && rowid < max_rowid_exclusive)
+          filter[rowid - start_offset] = 1;
+  }

Likely invalid or redundant comment.


34-81: Ensure Proper Error Handling for Constructor Preconditions.

The constructor of DMFileWithVectorIndexBlockInputStream uses RUNTIME_CHECK to validate preconditions. Ensure that these checks are appropriate and that exceptions are handled correctly if they fail.


217-255: Ensure Consistency in readByFollowingOtherColumns.

The method readByFollowingOtherColumns assembles a block using header_layout. Ensure consistency in the layout and verify that all necessary columns are present.

Verification successful

Consistency in readByFollowingOtherColumns is ensured.

The constructor of DMFileWithVectorIndexBlockInputStream includes runtime checks to verify that header_layout contains all necessary columns. These checks ensure that the layout used in readByFollowingOtherColumns is consistent. No further action is needed.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistency in the layout of `readByFollowingOtherColumns`.
# Test: Ensure that all necessary columns are present in the layout.

# Search for the usage of `header_layout` in the method.
rg --type cpp --context 5 'header_layout' -- 'dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp'

Length of output: 1937


273-384: Verify S3 File Cache Configuration in loadVectorIndex.

The method loadVectorIndex relies on S3 file cache. Ensure that the configuration is correct and that potential errors are handled gracefully.

Verification successful

S3 File Cache Configuration and Error Handling Verified

The loadVectorIndex method correctly verifies the presence of the S3 file cache and handles potential errors by retrying downloads and throwing exceptions if necessary. The configuration and error handling are properly implemented.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify S3 file cache configuration in `loadVectorIndex`.
# Test: Ensure that the configuration is correct and errors are handled.

# Search for S3 file cache usage in the method.
rg --type cpp --context 5 'S3' -- 'dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp'

Length of output: 1961


386-475: Ensure Correctness of Vector Search Logic in loadVectorSearchResult.

The method loadVectorSearchResult performs a vector search and updates the pack filter. Verify the correctness of the search logic and the impact on the pack filter.

dbms/src/Common/FailPoint.cpp (1)

109-109: Verify Integration of New Fail Point file_cache_fg_download_fail.

Ensure that the new fail point file_cache_fg_download_fail is correctly integrated and used in the codebase. Verify that it serves its intended purpose without unintended side effects.

Verification successful

Fail Point file_cache_fg_download_fail is Correctly Integrated

The fail point file_cache_fg_download_fail is properly declared, used, and tested within the codebase, indicating correct integration. No issues were found regarding its implementation or usage.

  • Declaration: dbms/src/Storages/S3/FileCache.cpp
  • Usage: Triggered in FileCache.cpp and tested in gtest_dm_vector_index.cpp
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify integration of new fail point `file_cache_fg_download_fail`.
# Test: Ensure the fail point is correctly integrated and used.

# Search for the usage of `file_cache_fg_download_fail` in the codebase.
rg --type cpp 'file_cache_fg_download_fail'

Length of output: 758

dbms/src/Interpreters/Context.h (1)

395-395: Verify Impact of Parameter Name Change in setVectorIndexCache.

The parameter name change from cache_size_in_bytes to cache_entities may impact how the method is used. Ensure that all calls to this method are updated accordingly and that the new name accurately reflects its purpose.

Verification successful

Parameter Name Change Verified: setVectorIndexCache

The parameter name change from cache_size_in_bytes to cache_entities is correctly implemented. The calls to this method have been updated, and the new name accurately reflects its purpose, as evidenced by the context and comments in the code.

  • gtest_dm_vector_index.cpp: Uses a literal value for cache entities.
  • Server.cpp: Uses a configuration value for cache entities with appropriate comments explaining the change.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify impact of parameter name change in `setVectorIndexCache`.
# Test: Ensure all calls to this method are updated and the new name reflects its purpose.

# Search for calls to `setVectorIndexCache` in the codebase.
rg --type cpp 'setVectorIndexCache'

Length of output: 420


Script:

#!/bin/bash
# Description: Gather context around calls to `setVectorIndexCache` to verify parameter name change.
# Get 5 lines of context around each call to understand usage and purpose.

# Get context around the call in `gtest_dm_vector_index.cpp`.
rg --type cpp 'setVectorIndexCache' -A 5 -B 5 dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp

# Get context around the call in `Server.cpp`.
rg --type cpp 'setVectorIndexCache' -A 5 -B 5 dbms/src/Server/Server.cpp

Length of output: 1392

dbms/src/Storages/S3/FileCache.cpp (1)

676-700: Foreground download logic is well-implemented.

The method effectively manages foreground downloads with appropriate logging and exception handling. The use of fail points for testing is commendable.

dbms/src/TiDB/Schema/TiDB.cpp (2)

411-417: Improved type safety and error handling for vector index.

The use of tipb::VectorIndexKind and tipb::VectorDistanceMetric, along with runtime checks, enhances type safety and robustness.


475-499: Enhanced clarity and robustness in vector index parsing.

The explicit parsing and structured initialization improve clarity and error handling in vector index management.

dbms/src/Interpreters/Context.cpp (4)

57-57: Include Directive Update Approved.

The change from VectorIndex.h to VectorIndexCache.h aligns with modifications in cache handling and improves clarity.


1388-1388: Simplified Cache Reset Approved.

The direct call to reset() on the shared pointer enhances clarity and conciseness.


1410-1410: Simplified Cache Reset Approved.

The direct call to reset() on the shared pointer enhances clarity and conciseness.


1391-1397: Method Signature Change Approved.

The parameter change from cache_size_in_bytes to cache_entities aligns with the new focus on entity count. Ensure all usages of this method are updated accordingly.

Verification successful

Method Signature Change Verified.

All usages of setVectorIndexCache have been updated to reflect the new parameter cache_entities. The changes are consistent across the codebase.

  • dbms/src/Server/Server.cpp: Usage with vec_index_cache_entities.
  • dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp: Usage with 1000.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of `setVectorIndexCache` to ensure they match the new parameter `cache_entities`.

# Test: Search for the method usage. Expect: Only occurrences of the new parameter name.
rg --type cpp -A 3 $'setVectorIndexCache'

Length of output: 1274

dbms/src/Server/Server.cpp (1)

1442-1445: Verify the impact of changing vector index cache configuration.

The change from a byte size to an entity count for the vector index cache could have significant implications on memory management and performance. Ensure that this change is well-documented and tested to confirm it meets the desired performance and memory usage goals.

dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (5)

93-96: Addition of cdVec() method is appropriate.

The cdVec() method provides a useful utility for defining vector columns, enhancing code reusability and readability.


884-887: Update to VectorIndexDefinition is appropriate.

The change from VectorIndexInfo to VectorIndexDefinition enhances type safety and aligns with the refactoring efforts. Ensure that all usages of VectorIndexDefinition are correctly updated throughout the codebase.


1427-1907: Comprehensive test cases for S3 and file cache interactions.

The test cases cover critical scenarios, including cache eviction and S3 failures, ensuring robustness. Consider verifying the test coverage to ensure all edge cases are addressed.

Verification successful

Comprehensive test cases for S3 and file cache interactions confirmed.

The test cases in VectorIndexSegmentOnS3Test cover various scenarios, ensuring robust handling of S3 and file cache interactions. The coverage appears comprehensive.

  • FileCacheNotEnabled
  • ReadWithoutIndex
  • ReadFromIndex
  • FileCacheEvict
  • FileCacheEvictAndVectorCacheDrop
  • FileCacheDeleted
  • FileCacheDeletedAndVectorCacheDrop
  • ConcurrentDownloadFromS3
  • S3Failure
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify test coverage for S3 and file cache interactions.

# Test: Search for test cases related to S3 and file cache. Expect: Comprehensive coverage.
rg --type cpp --type c++ --type c 'TEST_F\(VectorIndexSegmentOnS3Test'

Length of output: 100


Script:

#!/bin/bash
# Description: Verify test coverage for S3 and file cache interactions.

# Test: Search for test cases related to S3 and file cache. Expect: Comprehensive coverage.
rg --type cpp 'TEST_F\(VectorIndexSegmentOnS3Test'

Length of output: 1111


1207-1290: S3 interaction and file cache management setup is well-implemented.

The SetUp and TearDown methods ensure proper configuration and cleanup of S3 and file cache environments, which are essential for the tests. Verify that these configurations are consistent with the overall system requirements.

Verification successful

S3 and file cache management setup is consistent across the codebase.

The configurations in gtest_dm_vector_index.cpp align with the widespread use of S3::ClientFactory::instance() and FileCache::initialize() in other parts of the codebase, indicating a consistent approach. This setup is well-implemented and matches the overall system requirements.

  • Instances found in files like FlashService.cpp, S3LockService.cpp, Server.cpp, and others confirm the consistency.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify S3 configuration and file cache management in the codebase.

# Test: Search for S3 configuration and file cache management. Expect: Consistent configuration across the codebase.
rg --type cpp --type c++ --type c 'S3::ClientFactory|FileCache::initialize'

Length of output: 105


Script:

#!/bin/bash
# Description: Verify S3 configuration and file cache management in the codebase.

# Test: Search for S3 configuration and file cache management. Expect: Consistent configuration across the codebase.
rg --type cpp --type c 'S3::ClientFactory|FileCache::initialize'

Length of output: 9671


205-208: Update to VectorIndexDefinition is appropriate.

The change from VectorIndexInfo to VectorIndexDefinition enhances type safety and aligns with the refactoring efforts. Ensure that all usages of VectorIndexDefinition are correctly updated throughout the codebase.

Verification successful

VectorIndexDefinition is consistently used across the codebase.

The transition from VectorIndexInfo to VectorIndexDefinition is well-implemented throughout the relevant files. This change enhances type safety and aligns with the refactoring objectives.

  • dbms/src/TiDB/Schema/VectorIndex.h: Definition and usage of VectorIndexDefinition.
  • dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp: Multiple instances of VectorIndexDefinition usage in tests.
  • Other related files in DeltaMerge and TiDB modules.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of `VectorIndexDefinition` in the codebase.

# Test: Search for `VectorIndexDefinition` usage. Expect: Consistent usage across the codebase.
rg --type cpp --type c++ --type c 'VectorIndexDefinition'

Length of output: 85


Script:

#!/bin/bash
# Description: Verify all usages of `VectorIndexDefinition` in the codebase.

# Test: Search for `VectorIndexDefinition` usage. Expect: Consistent usage across the codebase.
rg --type cpp 'VectorIndexDefinition'

Length of output: 3281

Comment on lines +236 to +297
FileSegmentPtr FileCache::getOrWait(const S3::S3FilenameView & s3_fname, const std::optional<UInt64> & filesize)
{
auto s3_key = s3_fname.toFullKey();
auto file_type = getFileType(s3_key);
auto & table = tables[static_cast<UInt64>(file_type)];

std::unique_lock lock(mtx);

auto f = table.get(s3_key);
if (f != nullptr)
{
lock.unlock();
f->setLastAccessTime(std::chrono::system_clock::now());
auto status = f->waitForNotEmpty();
if (status == FileSegment::Status::Complete)
{
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_hit).Increment();
return f;
}
else
{
// On-going download failed, let the caller retry.
return nullptr;
}
}

GET_METRIC(tiflash_storage_remote_cache, type_dtfile_miss).Increment();

auto estimzted_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type);
if (!reserveSpaceImpl(file_type, estimzted_size, /*try_evict*/ true))
{
// Space not enough.
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_full).Increment();
LOG_DEBUG(
log,
"s3_key={} space not enough(capacity={} used={} estimzted_size={}), skip cache",
s3_key,
cache_capacity,
cache_used,
estimzted_size);

// Just throw, no need to let the caller retry.
throw Exception( //
ErrorCodes::S3_ERROR,
"Cannot reserve {} space for object {}",
estimzted_size,
s3_key);
return nullptr;
}

auto file_seg
= std::make_shared<FileSegment>(toLocalFilename(s3_key), FileSegment::Status::Empty, estimzted_size, file_type);
table.set(s3_key, file_seg);
lock.unlock();

++PerfContext::file_cache.fg_download_from_s3;
fgDownload(s3_key, file_seg);
if (!file_seg || !file_seg->isReadyToRead())
throw Exception(ErrorCodes::S3_ERROR, "Download object {} failed", s3_key);

return file_seg;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typo in variable name.

The variable estimzted_size contains a typo. It should be estimated_size.

- auto estimzted_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type);
+ auto estimated_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type);
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FileSegmentPtr FileCache::getOrWait(const S3::S3FilenameView & s3_fname, const std::optional<UInt64> & filesize)
{
auto s3_key = s3_fname.toFullKey();
auto file_type = getFileType(s3_key);
auto & table = tables[static_cast<UInt64>(file_type)];
std::unique_lock lock(mtx);
auto f = table.get(s3_key);
if (f != nullptr)
{
lock.unlock();
f->setLastAccessTime(std::chrono::system_clock::now());
auto status = f->waitForNotEmpty();
if (status == FileSegment::Status::Complete)
{
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_hit).Increment();
return f;
}
else
{
// On-going download failed, let the caller retry.
return nullptr;
}
}
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_miss).Increment();
auto estimzted_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type);
if (!reserveSpaceImpl(file_type, estimzted_size, /*try_evict*/ true))
{
// Space not enough.
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_full).Increment();
LOG_DEBUG(
log,
"s3_key={} space not enough(capacity={} used={} estimzted_size={}), skip cache",
s3_key,
cache_capacity,
cache_used,
estimzted_size);
// Just throw, no need to let the caller retry.
throw Exception( //
ErrorCodes::S3_ERROR,
"Cannot reserve {} space for object {}",
estimzted_size,
s3_key);
return nullptr;
}
auto file_seg
= std::make_shared<FileSegment>(toLocalFilename(s3_key), FileSegment::Status::Empty, estimzted_size, file_type);
table.set(s3_key, file_seg);
lock.unlock();
++PerfContext::file_cache.fg_download_from_s3;
fgDownload(s3_key, file_seg);
if (!file_seg || !file_seg->isReadyToRead())
throw Exception(ErrorCodes::S3_ERROR, "Download object {} failed", s3_key);
return file_seg;
}
auto estimated_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type);
if (!reserveSpaceImpl(file_type, estimated_size, /*try_evict*/ true))
{
// Space not enough.
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_full).Increment();
LOG_DEBUG(
log,
"s3_key={} space not enough(capacity={} used={} estimated_size={}), skip cache",
s3_key,
cache_capacity,
cache_used,
estimated_size);
// Just throw, no need to let the caller retry.
throw Exception( //
ErrorCodes::S3_ERROR,
"Cannot reserve {} space for object {}",
estimated_size,
s3_key);
return nullptr;
}

Comment on lines +23 to +61
size_t VectorIndexCache::cleanOutdatedCacheEntries()
{
size_t cleaned = 0;

std::unordered_set<String> files;
{
// Copy out the list to avoid occupying lock for too long.
// The complexity is just O(N) which is fine.
std::shared_lock lock(mu);
files = files_to_check;
}

for (const auto & file_path : files)
{
if (is_shutting_down)
break;

if (!cache.contains(file_path))
{
// It is evicted from LRU cache
std::unique_lock lock(mu);
files_to_check.erase(file_path);
}
else if (!Poco::File(file_path).exists())
{
LOG_INFO(log, "Dropping in-memory Vector Index cache because on-disk file is dropped, file={}", file_path);
{
std::unique_lock lock(mu);
files_to_check.erase(file_path);
}
cache.remove(file_path);
cleaned++;
}
}

LOG_DEBUG(log, "Cleaned {} outdated Vector Index cache entries", cleaned);

return cleaned;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider optimizing lock usage.

The method uses a shared lock to copy files_to_check, which is efficient. However, the repeated acquisition of a unique lock for each file could be optimized by batching operations if possible.

// Example: Consider batching operations that require a unique lock.

Comment on lines +342 to +355
// Vector index files are always not written into the merged file
// because we want to allow to be mmaped by the usearch.

auto buffer = ChecksumWriteBufferBuilder::build(
merged_file.buffer,
dmfile->getConfiguration()->getChecksumAlgorithm(),
dmfile->getConfiguration()->getChecksumFrameLength());

stream->vector_index->serializeBinary(*buffer);

col_stat.index_bytes = buffer->getMaterializedBytes();
const auto index_name = dmfile->colIndexPath(stream_name);
stream->vector_index->save(index_name);
col_stat.index_bytes = Poco::File(index_name).getSize();

// Memorize what kind of vector index it is, so that we can correctly restore it when reading.
col_stat.vector_index = dtpb::ColumnVectorIndexInfo{};
col_stat.vector_index->set_index_kind(String(magic_enum::enum_name(stream->vector_index->kind)));
col_stat.vector_index.emplace();
col_stat.vector_index->set_index_kind(
tipb::VectorIndexKind_Name(stream->vector_index->definition->kind));
col_stat.vector_index->set_distance_metric(
String(magic_enum::enum_name(stream->vector_index->distance_metric)));

MergedSubFileInfo info{
fname,
merged_file.file_info.number,
merged_file.file_info.size,
col_stat.index_bytes};
dmfile_meta->merged_sub_file_infos[fname] = info;

merged_file.file_info.size += col_stat.index_bytes;
buffer->next();
tipb::VectorDistanceMetric_Name(stream->vector_index->definition->distance_metric));
col_stat.vector_index->set_dimensions(stream->vector_index->definition->dimension);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor vector index file handling in finalizeColumn.

The logic for handling vector index files in finalizeColumn has been updated to save vector indices directly to a path derived from the column name. This change simplifies the process and may improve performance. Ensure that this refactoring aligns with the overall design and performance goals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants