-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry pick2 #9
base: feature/vector-index
Are you sure you want to change the base?
Cherry pick2 #9
Conversation
Signed-off-by: Wish <breezewish@outlook.com>
…pingcap#166) Signed-off-by: Wish <breezewish@outlook.com>
Signed-off-by: Wish <breezewish@outlook.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
WalkthroughThe recent changes focus on enhancing the vector indexing and file caching mechanisms within the database management system. Key modifications include the introduction of new classes and structures for better handling of vector indices, improvements in performance tracking for file caching operations, and updates to ensure more robust error handling. These updates streamline the codebase, enhance type safety, and improve overall efficiency in managing data retrieval and caching. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Cache
participant FileSystem
User->>Cache: Request File
Cache->>FileSystem: Check File Status
alt File Not Cached
FileSystem->>Cache: Download File
Cache->>User: Return File
else File Cached
Cache->>User: Return Cached File
end
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
@coderabbitai review |
Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Outside diff range, codebase verification and nitpick comments (9)
dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1)
44-44
: Ensure Serialization and Deserialization ofstd::optional<dtpb::VectorIndexFileProps>
The
toProto
andmergeFromProto
methods inColumnStat.h
do not currently handle thestd::optional<dtpb::VectorIndexFileProps>
. Ensure that these methods are updated to serialize and deserialize this field correctly to prevent potential issues.
- File:
dbms/src/Storages/DeltaMerge/File/ColumnStat.h
- Methods:
toProto
,mergeFromProto
Analysis chain
Verify the impact on serialization and deserialization.
The change to
std::optional<dtpb::VectorIndexFileProps>
may affect howvector_index
is serialized and deserialized. Ensure that all related methods handle the new type correctly.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the impact of the type change on serialization and deserialization. # Test: Search for methods related to serialization and deserialization of `ColumnStat`. rg --type cpp -A 3 $'toProto|mergeFromProto|serializeToBuffer|parseFromBuffer'Length of output: 8784
Script:
#!/bin/bash # Extract the implementation of `toProto` and `mergeFromProto` methods from `ColumnStat.h` ast-grep --lang cpp --pattern $'dtpb::ColumnStat toProto() const { $$$ }' dbms/src/Storages/DeltaMerge/File/ColumnStat.h ast-grep --lang cpp --pattern $'void mergeFromProto(const dtpb::ColumnStat & proto) { $$$ }' dbms/src/Storages/DeltaMerge/File/ColumnStat.hLength of output: 262
Script:
#!/bin/bash # Use ripgrep to search for the definitions of `toProto` and `mergeFromProto` methods in `ColumnStat.h` rg --type cpp 'dtpb::ColumnStat toProto\(\) const' dbms/src/Storages/DeltaMerge/File/ColumnStat.h -A 10 rg --type cpp 'void mergeFromProto\(const dtpb::ColumnStat & proto\)' dbms/src/Storages/DeltaMerge/File/ColumnStat.h -A 10Length of output: 1164
dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)
104-108
: Update documentation foraddStreams
method.The
addStreams
method signature has been updated to useTiDB::VectorIndexDefinitionPtr
instead ofTiDB::VectorIndexInfoPtr
. Ensure that any related documentation or comments are updated to reflect this change.dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)
83-103
: Consider Logging Additional Information in Destructor.The destructor logs the completion of reading a DMFile with vector index. Consider logging additional context, such as the number of rows processed or any errors encountered, to enhance traceability.
- "Finished read DMFile with vector index for column dmf_{}/{}(id={}), " + "Finished read DMFile with vector index for column dmf_{}/{}(id={}), processed_rows={}, "
106-123
: Clarify Filtering Logic inread
Method.The
read
method contains logic for filtering based on aFilterPtr
. Ensure that the comments and logic are clear, especially regarding the implications ofreturn_filter
beingfalse
.- // If return_filter == false, we must filter by ourselves. + // If return_filter == false, apply filtering within this method.
170-215
: Consider Caching Results inreadByIndexReader
.The method
readByIndexReader
iterates over packs and reads vector columns. Consider caching results if this method is called frequently with the same parameters to improve performance.
258-271
: Add Logging for Load Status inload
Method.Consider adding logging to indicate when the
load
method is invoked and completed, which can be useful for debugging and performance monitoring.+ LOG_DEBUG(log, "Loading vector index and search results");
477-485
: Clarify the Purpose ofgetPackIdFromBlock
.The method
getPackIdFromBlock
retrieves the pack ID from a block. Ensure that the purpose and usage of this method are well-documented, especially regarding assumptions about block alignment.- // The start offset of a block is ensured to be aligned with the pack. + // Assumes the start offset of a block is aligned with the pack.dbms/src/Storages/S3/FileCache.cpp (2)
68-107
: Use named constants for timeout values.The method uses magic numbers (30 and 300 seconds) for timeouts. Consider defining named constants for these values to improve readability and maintainability.
constexpr auto WAIT_TIMEOUT = std::chrono::seconds(30); constexpr auto MAX_WAIT_TIMEOUT = std::chrono::seconds(300);
160-176
: Enhance logging for file download attempts.Consider adding more detailed logging to trace the execution flow, especially when a file is removed from the cache due to inconsistencies.
LOG_DEBUG(log, "Attempting to download file for local read: {}", s3_fname.toFullKey());
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (36)
- contrib/tipb (1 hunks)
- dbms/src/Common/FailPoint.cpp (1 hunks)
- dbms/src/Common/LRUCache.h (1 hunks)
- dbms/src/Interpreters/Context.cpp (3 hunks)
- dbms/src/Interpreters/Context.h (1 hunks)
- dbms/src/Server/Server.cpp (1 hunks)
- dbms/src/Storages/DeltaMerge/DMContext.h (2 hunks)
- dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (1 hunks)
- dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1 hunks)
- dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (1 hunks)
- dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (5 hunks)
- dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (4 hunks)
- dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4 hunks)
- dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (5 hunks)
- dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (4 hunks)
- dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (3 hunks)
- dbms/src/Storages/DeltaMerge/Index/RSIndex.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (5 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2 hunks)
- dbms/src/Storages/DeltaMerge/ScanContext.h (6 hunks)
- dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (26 hunks)
- dbms/src/Storages/S3/FileCache.cpp (8 hunks)
- dbms/src/Storages/S3/FileCache.h (6 hunks)
- dbms/src/Storages/S3/FileCachePerf.cpp (1 hunks)
- dbms/src/Storages/S3/FileCachePerf.h (1 hunks)
- dbms/src/TiDB/Schema/TiDB.cpp (3 hunks)
- dbms/src/TiDB/Schema/TiDB.h (1 hunks)
- dbms/src/TiDB/Schema/VectorIndex.h (1 hunks)
Files skipped from review due to trivial changes (1)
- contrib/tipb
Additional comments not posted (76)
dbms/src/Storages/S3/FileCachePerf.cpp (1)
1-22
: LGTM!The file sets up a thread-local performance context for file caching. The setup is straightforward and aligns with typical practices for performance tracking.
dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (1)
1-22
: LGTM!The file sets up a thread-local performance context for vector search operations. The setup is straightforward and aligns with typical practices for performance tracking.
dbms/src/Storages/DeltaMerge/Index/RSIndex.h (1)
Line range hint
1-26
:
Simplification acknowledged and approved.The removal of the
vector
member and its constructor simplifies theRSIndex
structure, focusing on the essential components. This change enhances clarity and reduces unnecessary complexity.dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1)
24-28
: LGTM! Modular design enhances clarity and maintainability.The introduction of
VectorIndexBuilder
andVectorIndexViewer
classes, along with their smart pointer types, suggests a more modular approach. This can improve code clarity and maintainability by separating concerns between building and viewing vector indices.dbms/src/Storages/S3/FileCachePerf.h (1)
1-37
: LGTM! Effective use of thread-local storage for performance tracking.The
FileCachePerfContext
structure provides a clear way to track performance metrics related to file caching. Usingthread_local
forfile_cache
ensures that performance data is managed per thread, which is suitable for concurrent environments.dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1)
1-37
: LGTM! Efficient thread-local performance tracking for vector search.The
VectorSearchPerfContext
structure is well-designed for tracking performance metrics in vector search operations. The use ofthread_local
forvector_search
ensures that performance data is isolated per thread, which is advantageous in concurrent processing.dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (2)
27-39
: ReviewVectorIndexHNSWBuilder
class changes.The class
VectorIndexHNSWBuilder
has been introduced, replacingUSearchIndexWithSerialization
. The constructor now takes aVectorIndexDefinitionPtr
, which suggests a design shift towards using definitions for initialization. Ensure that the new design aligns with the overall architecture and that theaddBlock
andsave
methods are correctly implemented.Verification successful
VectorIndexHNSWBuilder Class Implementation Verified
The
VectorIndexHNSWBuilder
class and its methodsaddBlock
andsave
are implemented correctly, with appropriate type checks and error handling. The integration with the rest of the codebase is consistent with the new design. No issues found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of the `VectorIndexHNSWBuilder` class. # Test: Search for the class definition and usage. Expect: Consistent usage with new constructor. rg --type cpp -A 5 $'class VectorIndexHNSWBuilder'Length of output: 652
Script:
#!/bin/bash # Description: Find all usages of the `VectorIndexHNSWBuilder` class to verify its integration and usage. # Test: Search for all instances where `VectorIndexHNSWBuilder` is used in the codebase. rg --type cpp 'VectorIndexHNSWBuilder'Length of output: 867
Script:
#!/bin/bash # Description: Verify the implementation of the `addBlock` and `save` methods in `Index.cpp`. # Test: Retrieve the implementation details of `addBlock` and `save` methods. rg --type cpp -A 10 'void VectorIndexHNSWBuilder::addBlock' dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp rg --type cpp -A 10 'void VectorIndexHNSWBuilder::save' dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cppLength of output: 1248
41-55
: ReviewVectorIndexHNSWViewer
class introduction.The
VectorIndexHNSWViewer
class extendsVectorIndexViewer
and introduces a staticview
method. Thesearch
method has been updated to remove theSearchStatistics
parameter, focusing on query information and valid rows. Ensure that this change does not impact the functionality negatively and aligns with the intended design.Verification successful
Removal of
SearchStatistics
Parameter VerifiedThe removal of the
SearchStatistics
parameter from thesearch
method in theVectorIndexHNSWViewer
class does not appear to impact functionality negatively. There are no references toSearchStatistics
in the codebase, indicating it was not crucial to the method's operation. This change aligns with the intended design.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of the `VectorIndexHNSWViewer` class. # Test: Search for the class definition and usage. Expect: Proper integration with the new search method. rg --type cpp -A 5 $'class VectorIndexHNSWViewer'Length of output: 650
Script:
#!/bin/bash # Description: Search for the usage of the `search` method in the `VectorIndexHNSWViewer` class. # Test: Find all occurrences of `search` method calls. Expect: Understand how the method is used and ensure the change is safe. rg --type cpp 'search\(' -A 3Length of output: 22355
Script:
#!/bin/bash # Description: Search for any references to `SearchStatistics` in relation to the `search` method. # Test: Find all occurrences of `SearchStatistics` to understand its previous usage and impact. rg --type cpp 'SearchStatistics'Length of output: 32
Script:
#!/bin/bash # Description: Search for comments or documentation mentioning `SearchStatistics`. # Test: Find any comments or documentation that reference `SearchStatistics`. rg --type cpp 'SearchStatistics' -g '*.h' -g '*.cpp' -A 3 -B 3Length of output: 62
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1)
30-77
: ReviewVectorIndexCache
class implementation.The
VectorIndexCache
class uses an LRUCache for cachingVectorIndexViewer
objects. It includes mechanisms for cleaning outdated cache entries and managing shutdown operations. Ensure that the use of mutexes and condition variables is correct and that the cleaning logic is efficient and does not introduce race conditions.Verification successful
Integration and Usage of
VectorIndexCache
VerifiedThe
VectorIndexCache
class is well-integrated within the codebase, with clear methods for managing its lifecycle. It is used in various components, including the server setup and context management, and its functionality is covered by unit tests.
Server.cpp
: Sets the vector index cache in the global context.Context.cpp
: Provides methods to set, get, and drop the vector index cache.- Test files confirm the cache's functionality is tested.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of the `VectorIndexCache` class. # Test: Search for the class definition and usage. Expect: Proper integration with cache management logic. rg --type cpp -A 5 $'class VectorIndexCache'Length of output: 1190
Script:
#!/bin/bash # Description: Search for the usage of `VectorIndexCache` to understand its integration and usage. # Test: Find all occurrences of `VectorIndexCache` to verify its integration. rg --type cpp 'VectorIndexCache'Length of output: 3577
dbms/src/TiDB/Schema/VectorIndex.h (2)
27-35
: ReviewVectorIndexDefinition
struct introduction.The
VectorIndexDefinition
struct replaces previous enums and structures, encapsulating index properties. Ensure that the struct is comprehensive and correctly represents the necessary properties. The TODO comment suggests potential future additions, which should be tracked.Verification successful
The
VectorIndexDefinition
struct is consistently integrated.The
VectorIndexDefinition
struct is used across various components and is tested, indicating proper integration. The struct's usage in different contexts, including formatters for output, suggests it is well-incorporated into the codebase. No issues found with its current implementation.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of the `VectorIndexDefinition` struct. # Test: Search for the struct definition and usage. Expect: Consistent usage with new properties. rg --type cpp -A 5 $'struct VectorIndexDefinition'Length of output: 497
Script:
#!/bin/bash # Description: Search for usage of `VectorIndexDefinition` in the codebase to verify its integration. # Test: Find all occurrences of `VectorIndexDefinition` usage. rg --type cpp 'VectorIndexDefinition'Length of output: 3281
Line range hint
48-73
:
Reviewfmt::formatter
forVectorIndexDefinition
.The
fmt::formatter
specialization forVectorIndexDefinition
has been updated. Ensure that the formatting logic correctly reflects the new struct and that it integrates well with the existing logging or output mechanisms.dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (3)
Line range hint
33-41
:
Refactor toVectorIndexBuilder
: Ensure type support logic is correct.The
isSupportedType
method has been moved toVectorIndexBuilder
. This refactoring aligns with the new class responsibilities. Verify that the logic for checking supported types remains correct and comprehensive.
59-73
: Refactor toVectorIndexViewer
: Ensure view logic is correct.The
view
method replacesload
and now usesVectorIndexFileProps
. This refactoring is intended to improve clarity and modularity. Verify that the logic for viewing vector index files is correctly implemented and that all relevant cases are handled.
42-56
: Refactor toVectorIndexBuilder
: Check method signature and logic.The
create
method now usesVectorIndexDefinitionPtr
instead ofVectorIndexInfo
. This change enhances encapsulation and type safety. Ensure that all usages ofcreate
are updated to match the new signature and that the switch-case logic correctly handles all necessary cases.Verification successful
Verification Successful:
create
Method Signature Updated CorrectlyThe usage of
VectorIndexBuilder::create
has been successfully updated to the new signature usingVectorIndexDefinitionPtr
. All identified usages are consistent with this change.
dbms/src/Storages/DeltaMerge/File/DMFileWriter.h
: Thedo_vector_index
parameter is correctly typed asTiDB::VectorIndexDefinitionPtr
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `create` method are updated to the new signature. # Test: Search for the `create` method usage. Expect: Only occurrences of the new signature. rg --type cpp -A 5 $'VectorIndexBuilder::create'Length of output: 1216
Script:
#!/bin/bash # Description: Check the type of `do_vector_index` in `DMFileWriter.h` to verify it matches the new signature. rg --type cpp 'do_vector_index' dbms/src/Storages/DeltaMerge/File/DMFileWriter.h -A 5 -B 5Length of output: 1700
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (2)
31-55
: Introduction ofVectorIndexBuilder
: Review class structure and methods.The
VectorIndexBuilder
class is introduced to handle the construction of vector indices. The methodscreate
,isSupportedType
,addBlock
, andsave
are aligned with its responsibilities. Ensure that the class design supports efficient and correct index building.
58-90
: Introduction ofVectorIndexViewer
: Review class structure and methods.The
VectorIndexViewer
class is introduced to handle viewing vector index files. The methodsview
,search
, andget
are aligned with its responsibilities. Ensure that the class design supports efficient and correct index viewing.dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (3)
47-47
: Update toVectorIndexViewerPtr
: Ensure index interaction is correct.The
index
member variable now usesVectorIndexViewerPtr
, reflecting the new class structure. Verify that all interactions with the index are correctly updated to use the viewer.
56-56
: UpdatecalcResultsByPack
method signature: Ensure efficient result handling.The method now accepts a reference to a vector of
VectorIndexViewer::Key
, improving efficiency by avoiding unnecessary copies. Ensure that all calls to this method are updated to match the new signature.
65-71
: Update constructor: Ensure alignment with new indexing strategy.The constructor now takes
VectorIndexViewerPtr
and a reference to a vector ofVectorIndexViewer::Key
. This change aligns with the new indexing strategy. Ensure that all instances of this class are constructed using the updated parameters.dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (3)
93-98
: Destructor implementation looks good.The destructor correctly handles shutting down the cleaner thread and joining it.
86-91
: Verify thread management.Ensure that the cleaner thread is properly managed and terminated to prevent resource leaks.
Verification successful
Thread management is correctly implemented in
VectorIndexCache
.The destructor of
VectorIndexCache
ensures proper termination of the cleaner thread by setting a shutdown flag, notifying the condition variable, and joining the thread. This prevents resource leaks and ensures clean shutdown.
- Destructor:
VectorIndexCache::~VectorIndexCache()
setsis_shutting_down = true
, notifiesshutdown_cv
, and callscleaner_thread.join()
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify proper thread management in the constructor of `VectorIndexCache`. # Test: Check for proper thread management and termination. rg --type cpp 'VectorIndexCache' -A 5Length of output: 16167
63-83
: Ensure proper shutdown handling.The loop correctly handles shutdown signals, but ensure that any resources are properly released if exceptions occur.
Verification successful
Proper shutdown handling and thread management confirmed.
The
cleanOutdatedLoop
method and its associated thread are correctly managed. The destructor properly sets the shutdown flag, notifies the condition variable, and joins the thread, ensuring a clean and safe shutdown process. No further action is needed.
- Destructor:
VectorIndexCache::~VectorIndexCache()
properly joins thecleaner_thread
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify proper shutdown handling in the `cleanOutdatedLoop`. # Test: Check for any resources that might not be released upon exceptions. rg --type cpp 'cleanOutdatedLoop' -A 5Length of output: 1429
Script:
#!/bin/bash # Description: Check if `cleaner_thread` is properly managed in the `VectorIndexCache` class. # Test: Look for thread management (join or detach) in the destructor or elsewhere in the class. rg --type cpp 'cleaner_thread' -A 10Length of output: 1717
dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (2)
90-94
: Addition ofVectorIndexFileProps
looks good.The new message is well-defined and should enhance the handling of vector index properties.
65-67
: Verify backward compatibility.The change from
ColumnVectorIndexInfo
toVectorIndexFileProps
may affect backward compatibility. Ensure that the reserved field is correctly handled in the codebase.Verification successful
Backward compatibility maintained.
The change from
ColumnVectorIndexInfo
toVectorIndexFileProps
does not impact backward compatibility, as there are no remaining references toColumnVectorIndexInfo
in the codebase. The use of the reserved field 101 ensures that the field number is not reused.
- The reserved field 101 is correctly implemented in
dmfile.proto
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify backward compatibility for changes in `ColumnStat`. # Test: Check for usage of the reserved field and ensure compatibility. rg --type proto 'reserved 101' -A 5Length of output: 67
Script:
#!/bin/bash # Description: Verify backward compatibility for changes in `ColumnStat`. # Test: Check for usage of the reserved field and ensure compatibility. rg 'reserved 101' --glob '*.proto' -A 5Length of output: 540
Script:
#!/bin/bash # Description: Check for references to `ColumnVectorIndexInfo` in the codebase to ensure backward compatibility. # Test: Search for `ColumnVectorIndexInfo` in the codebase. rg 'ColumnVectorIndexInfo'Length of output: 26
dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2)
96-96
: Verify impact of access modifier change forserializeColumnFileSet
.Changing the access modifier from public to private may affect accessibility. Ensure that this change is intentional and does not introduce any issues.
Verification successful
Access modifier change for
serializeColumnFileSet
is safe.The method
serializeColumnFileSet
is only used within theSerializer.cpp
file and does not have external dependencies. Changing its access modifier to private improves encapsulation without affecting other parts of the codebase.
- No external usage of
serializeColumnFileSet
found outsideSerializer.cpp
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the impact of changing the access modifier for `serializeColumnFileSet`. # Test: Check for any unintended consequences of making `serializeColumnFileSet` private. rg --type cpp 'serializeColumnFileSet' -A 5Length of output: 4711
96-96
: Verify impact of access modifier change forserializeSegment
.Changing the access modifier from private to public may affect encapsulation. Ensure that this change is intentional and does not introduce any issues.
dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (2)
119-119
: Type consistency maintained.The update to use
VectorIndexViewer::Key
forrowid
ensures consistency with the new type usage across the file.
37-47
: Consider the performance impact of the runtime check.The addition of a runtime check to ensure that
sorted_results
is sorted improves robustness. However, this could introduce a performance overhead in critical paths. Consider evaluating the necessity of this check in production environments or providing a build configuration to disable it.dbms/src/Storages/DeltaMerge/DMContext.h (1)
136-137
: Verify the usage of the newscan_context
parameter.The addition of the
scan_context
parameter enhances flexibility, but ensure that its usage is correctly handled throughout the codebase.dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4)
75-75
: Use builder pattern for vector index creation.The
vector_index
is now created usingVectorIndexBuilder::create
, which enhances flexibility. Ensure that the builder pattern is correctly implemented and utilized.
101-101
: Update member variable type.The
vector_index
member variable type has been updated toVectorIndexBuilderPtr
. Ensure that this change is reflected in all relevant parts of the codebase.Verification successful
Update member variable type verified.
The
vector_index
member variable type has been updated toVectorIndexBuilderPtr
in the relevant parts of the codebase. No inconsistencies were found in the application of this change.
- The
vector_index
member variable inDMFileWriter.h
has been correctly updated toVectorIndexBuilderPtr
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `vector_index` member variable are updated to use `VectorIndexBuilderPtr`. # Test: Search for `vector_index` usage. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'vector_index'Length of output: 30888
57-57
: Update constructor parameter type.The constructor now uses
TiDB::VectorIndexDefinitionPtr
instead ofTiDB::VectorIndexInfoPtr
. Ensure that all instances of this class are updated to pass the correct type.
163-163
: Update method parameter type.The
addStreams
method now usesTiDB::VectorIndexDefinitionPtr
instead ofTiDB::VectorIndexInfoPtr
. Ensure that all calls to this method are updated to pass the correct type.Verification successful
All
addStreams
method calls use the correct parameter type.The
addStreams
method now correctly usesTiDB::VectorIndexDefinitionPtr
for thedo_vector_index
parameter, and all relevant calls to this method have been updated accordingly. The verification confirms that the type change has been properly implemented.
DMFileWriter.cpp
: The call toaddStreams
usescd.vector_index
, which is correctly defined asTiDB::VectorIndexDefinitionPtr
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all calls to `addStreams` use the correct type for `do_vector_index`. # Test: Search for `addStreams` calls. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'addStreams'Length of output: 1531
Script:
#!/bin/bash # Use ast-grep to find all calls to the addStreams method and check the type of the do_vector_index parameter. ast-grep --lang cpp --pattern 'addStreams($_, $_, $_, $_)'Length of output: 60
Script:
#!/bin/bash # Use rg to find all calls to the addStreams method and include context lines to check parameter types. rg --type cpp 'addStreams' -A 3 -B 2Length of output: 1678
Script:
#!/bin/bash # Search for the definition of `cd` to confirm the type of `cd.vector_index`. rg --type cpp 'cd\s*=' -A 3 -B 2Length of output: 11390
dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (2)
95-95
: Update member variable type.The
vector_index
member variable type has been updated toTiDB::VectorIndexDefinitionPtr
. Ensure that this change is reflected in all relevant parts of the codebase.Verification successful
Verification complete: Member variable type updated consistently.
The
vector_index
member variable type change toTiDB::VectorIndexDefinitionPtr
inColumnDefine
has been consistently applied across the codebase. No discrepancies found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `vector_index` member variable in `ColumnDefine` are updated to use `TiDB::VectorIndexDefinitionPtr`. # Test: Search for `vector_index` usage in `ColumnDefine`. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'ColumnDefine'Length of output: 324094
102-102
: Update constructor parameter type.The constructor now uses
TiDB::VectorIndexDefinitionPtr
forvector_index_
. Ensure that all instances of this struct are updated to pass the correct type.Verification successful
No issues found with
ColumnDefine
constructor usage. The search did not find any instances ofColumnDefine
being instantiated with thevector_index_
parameter, indicating that the codebase does not directly use this parameter in constructor calls.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all instantiations of the `ColumnDefine` struct use the correct type for `vector_index_`. # Test: Search for `ColumnDefine` instantiations. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'ColumnDefine'Length of output: 324094
Script:
#!/bin/bash # Search for all constructor calls for `ColumnDefine` that include the `vector_index_` parameter. ast-grep --lang cpp --pattern 'ColumnDefine($_, $_, $_, $_, $_)' --jsonLength of output: 76
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (7)
96-96
: Override destructor.The destructor now explicitly uses the
override
keyword, which improves clarity and ensures that the base class has a virtual destructor.
108-108
: Override method signature.The
read
method now explicitly uses theoverride
keyword, which improves clarity and ensures that the base class has a virtual method.
144-144
: Declare method without implementation.The
readByIndexReader
method is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.
148-148
: Declare method without implementation.The
readByFollowingOtherColumns
method is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.
182-182
: Update member variable type.The
vec_index
member variable type has been updated toVectorIndexViewerPtr
. Ensure that this change is reflected in all relevant parts of the codebase.Verification successful
Member variable type update verified.
The
vec_index
member variable type has been correctly updated toVectorIndexViewerPtr
in all relevant parts of the codebase. No further action is needed.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `vec_index` member variable are updated to use `VectorIndexViewerPtr`. # Test: Search for `vec_index` usage. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'vec_index'Length of output: 15367
186-186
: Update member variable name and purpose.The
valid_rows_after_search
variable has been replaced withsorted_results
, indicating a change in how results are managed. Ensure that this change is reflected in all relevant parts of the codebase.Verification successful
Change from
valid_rows_after_search
tosorted_results
is correctly reflected.The
sorted_results
variable is used consistently across the codebase, and there are no remaining references tovalid_rows_after_search
. The transition appears to be complete and correctly implemented.
- Files checked:
DMFileWithVectorIndexBlockInputStream.h
VectorColumnFromIndexReader.h
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `sorted_results` member variable are updated accordingly. # Test: Search for `sorted_results` usage. Expect: Only occurrences with the new name and purpose. rg --type cpp -A 5 $'sorted_results'Length of output: 11304
113-113
: Declare method without implementation.The
readImpl
method is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.Verification successful
Implementation Found for
readImpl
MethodThe
readImpl
method declared inDMFileWithVectorIndexBlockInputStream.h
is implemented in the corresponding source fileDMFileWithVectorIndexBlockInputStream.cpp
. This addresses the concern regarding the method's implementation. No further action is required.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of `readImpl` is provided in the corresponding source file. # Test: Search for `readImpl` implementation. Expect: Implementation in the corresponding source file. rg --type cpp -A 5 $'readImpl'Length of output: 88199
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (5)
38-54
: LGTM!The
getUSearchMetricKind
function correctly maps distance metrics and handles unsupported metrics with a runtime check.
58-64
: Constructor looks good!The
VectorIndexHNSWBuilder
constructor correctly initializes the index with the appropriate metric and checks the index kind.
Line range hint
67-106
: Robust method implementation!The
addBlock
method effectively handles memory allocation and data validity checks, ensuring robust data addition to the index.
110-113
: Save method is well-implemented!The
save
method effectively handles errors using runtime checks, ensuring the index is saved correctly.
Line range hint
134-198
: Comprehensive search method!The
search
method includes detailed checks and exception handling, ensuring robust search functionality. Note the TODO comment regardingefSearch
support.dbms/src/Common/LRUCache.h (1)
74-80
: Thread-safecontains
function!The
contains
function correctly checks for the presence of a key using a lock guard, ensuring thread safety without altering the LRU order.dbms/src/Storages/S3/FileCache.h (4)
228-233
: Efficient caching strategy!The
downloadFileForLocalRead
method effectively ensures that files in use are not evicted prematurely, enhancing caching strategy.
249-253
: Robust retrieval strategy!The
getOrWait
method provides a robust mechanism to ensure file availability, handling waiting and foreground downloading effectively.
256-256
: Foreground download method looks good!The
fgDownload
method is well-implemented to handle file downloads in the foreground, ensuring correct processing.
75-76
: VerifywaitForNotEmpty
implementation.Ensure that the
waitForNotEmpty
method correctly blocks until the segment is populated, maintaining thread safety.Verification successful
waitForNotEmpty
Method Implementation VerifiedThe
waitForNotEmpty
method inFileSegment
is correctly implemented to block until the segment is populated, using mutexes and condition variables to ensure thread safety. The method includes a timeout to prevent indefinite blocking.
- File:
dbms/src/Storages/S3/FileCache.cpp
- Implementation Details: Uses a mutex (
mtx
) and a condition variable (cv_ready
) to wait for the status to change fromEmpty
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of `waitForNotEmpty` method in `FileSegment`. # Test: Check if `waitForNotEmpty` is implemented and used correctly. ast-grep --lang cpp --pattern $'Status FileSegment::waitForNotEmpty() { $$$ }'Length of output: 81
Script:
#!/bin/bash # Description: Search for any method named `waitForNotEmpty` in the entire codebase to locate its implementation or usage. # Search for `waitForNotEmpty` method across all C++ files. rg 'waitForNotEmpty' --type cpp -A 5Length of output: 2284
dbms/src/TiDB/Schema/TiDB.h (1)
209-209
: Verify the impact of changingvector_index
type.The type of
vector_index
in theColumnInfo
struct has been changed fromVectorIndexInfoPtr
toVectorIndexDefinitionPtr
. This change may affect how vector indices are used throughout the codebase. Ensure that all usages ofvector_index
are updated to accommodate this new type.dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)
64-64
: EnsureVectorIndexBuilder::isSupportedType
is correctly implemented.The change from
VectorIndex::isSupportedType
toVectorIndexBuilder::isSupportedType
suggests a restructuring of vector index handling. Verify that the new method is implemented correctly and covers all necessary type checks.Verification successful
Verification of
VectorIndexBuilder::isSupportedType
ImplementationThe
VectorIndexBuilder::isSupportedType
method is correctly implemented. It checks for nullable types and supportsDataTypeFloat32
. This implementation aligns with its intended use inDMFileWriter.cpp
to ensure only supported types are processed for vector indexing.
- File:
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp
- Method:
bool VectorIndexBuilder::isSupportedType(const IDataType & type)
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of `VectorIndexBuilder::isSupportedType`. # Test: Search for the implementation of `isSupportedType`. Expect: Correct type checks. ast-grep --lang cpp --pattern $'class VectorIndexBuilder { $$$ static bool isSupportedType($_) { $$$ } $$$ }'Length of output: 469
Script:
#!/bin/bash # Search for the definition and usage of `isSupportedType` in the codebase rg 'isSupportedType' --context 5 --type cpp # Search for the definition of the `VectorIndexBuilder` class rg 'class VectorIndexBuilder' --context 10 --type cppLength of output: 5909
dbms/src/Storages/DeltaMerge/ScanContext.h (2)
93-93
: Initializetotal_vector_idx_load_from_s3
correctly.The new atomic variable
total_vector_idx_load_from_s3
is introduced to track vector index load times from S3. Ensure that it is correctly initialized and used throughout the class.
150-150
: Verify integration oftotal_vector_idx_load_from_s3
in serialization and merging.The new variable
total_vector_idx_load_from_s3
is integrated into serialization, deserialization, and merging logic. Verify that these changes are correctly implemented and that the variable is accurately tracked and aggregated.Also applies to: 202-202, 262-262, 314-314
Verification successful
Integration of
total_vector_idx_load_from_s3
is correctly implemented.The variable
total_vector_idx_load_from_s3
is properly integrated into serialization and merging logic. It is serialized and deserialized usingtiflash_scan_context_pb
, and its value is aggregated during merging. Additionally, its behavior is verified through multiple tests. No issues found with its integration.
- Serialization and Deserialization: Properly handled via
tiflash_scan_context_pb
.- Merging Logic: Correctly aggregated with other contexts.
- Testing: Adequately tested in
gtest_dm_vector_index.cpp
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the integration of `total_vector_idx_load_from_s3` in serialization, deserialization, and merging. # Test: Search for `total_vector_idx_load_from_s3` usage in serialization and merging. Expect: Correct integration. rg --type cpp --word-regexp 'total_vector_idx_load_from_s3'Length of output: 2866
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)
125-168
: Optimize Filter Assignment inreadImpl
Method.The filter assignment in
readImpl
could be optimized by using a more efficient algorithm or data structure, especially ifsorted_results
is large.- auto it = std::lower_bound(sorted_results.begin(), sorted_results.end(), start_offset); - while (it != sorted_results.end()) - { - auto rowid = *it; - if (rowid >= max_rowid_exclusive) - break; - filter[rowid - start_offset] = 1; - ++it; - } + for (auto rowid : sorted_results) + { + if (rowid >= start_offset && rowid < max_rowid_exclusive) + filter[rowid - start_offset] = 1; + }Likely invalid or redundant comment.
34-81
: Ensure Proper Error Handling for Constructor Preconditions.The constructor of
DMFileWithVectorIndexBlockInputStream
usesRUNTIME_CHECK
to validate preconditions. Ensure that these checks are appropriate and that exceptions are handled correctly if they fail.
217-255
: Ensure Consistency inreadByFollowingOtherColumns
.The method
readByFollowingOtherColumns
assembles a block usingheader_layout
. Ensure consistency in the layout and verify that all necessary columns are present.Verification successful
Consistency in
readByFollowingOtherColumns
is ensured.The constructor of
DMFileWithVectorIndexBlockInputStream
includes runtime checks to verify thatheader_layout
contains all necessary columns. These checks ensure that the layout used inreadByFollowingOtherColumns
is consistent. No further action is needed.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify consistency in the layout of `readByFollowingOtherColumns`. # Test: Ensure that all necessary columns are present in the layout. # Search for the usage of `header_layout` in the method. rg --type cpp --context 5 'header_layout' -- 'dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp'Length of output: 1937
273-384
: Verify S3 File Cache Configuration inloadVectorIndex
.The method
loadVectorIndex
relies on S3 file cache. Ensure that the configuration is correct and that potential errors are handled gracefully.Verification successful
S3 File Cache Configuration and Error Handling Verified
The
loadVectorIndex
method correctly verifies the presence of the S3 file cache and handles potential errors by retrying downloads and throwing exceptions if necessary. The configuration and error handling are properly implemented.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify S3 file cache configuration in `loadVectorIndex`. # Test: Ensure that the configuration is correct and errors are handled. # Search for S3 file cache usage in the method. rg --type cpp --context 5 'S3' -- 'dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp'Length of output: 1961
386-475
: Ensure Correctness of Vector Search Logic inloadVectorSearchResult
.The method
loadVectorSearchResult
performs a vector search and updates the pack filter. Verify the correctness of the search logic and the impact on the pack filter.dbms/src/Common/FailPoint.cpp (1)
109-109
: Verify Integration of New Fail Pointfile_cache_fg_download_fail
.Ensure that the new fail point
file_cache_fg_download_fail
is correctly integrated and used in the codebase. Verify that it serves its intended purpose without unintended side effects.Verification successful
Fail Point
file_cache_fg_download_fail
is Correctly IntegratedThe fail point
file_cache_fg_download_fail
is properly declared, used, and tested within the codebase, indicating correct integration. No issues were found regarding its implementation or usage.
- Declaration:
dbms/src/Storages/S3/FileCache.cpp
- Usage: Triggered in
FileCache.cpp
and tested ingtest_dm_vector_index.cpp
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify integration of new fail point `file_cache_fg_download_fail`. # Test: Ensure the fail point is correctly integrated and used. # Search for the usage of `file_cache_fg_download_fail` in the codebase. rg --type cpp 'file_cache_fg_download_fail'Length of output: 758
dbms/src/Interpreters/Context.h (1)
395-395
: Verify Impact of Parameter Name Change insetVectorIndexCache
.The parameter name change from
cache_size_in_bytes
tocache_entities
may impact how the method is used. Ensure that all calls to this method are updated accordingly and that the new name accurately reflects its purpose.Verification successful
Parameter Name Change Verified:
setVectorIndexCache
The parameter name change from
cache_size_in_bytes
tocache_entities
is correctly implemented. The calls to this method have been updated, and the new name accurately reflects its purpose, as evidenced by the context and comments in the code.
gtest_dm_vector_index.cpp
: Uses a literal value for cache entities.Server.cpp
: Uses a configuration value for cache entities with appropriate comments explaining the change.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify impact of parameter name change in `setVectorIndexCache`. # Test: Ensure all calls to this method are updated and the new name reflects its purpose. # Search for calls to `setVectorIndexCache` in the codebase. rg --type cpp 'setVectorIndexCache'Length of output: 420
Script:
#!/bin/bash # Description: Gather context around calls to `setVectorIndexCache` to verify parameter name change. # Get 5 lines of context around each call to understand usage and purpose. # Get context around the call in `gtest_dm_vector_index.cpp`. rg --type cpp 'setVectorIndexCache' -A 5 -B 5 dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp # Get context around the call in `Server.cpp`. rg --type cpp 'setVectorIndexCache' -A 5 -B 5 dbms/src/Server/Server.cppLength of output: 1392
dbms/src/Storages/S3/FileCache.cpp (1)
676-700
: Foreground download logic is well-implemented.The method effectively manages foreground downloads with appropriate logging and exception handling. The use of fail points for testing is commendable.
dbms/src/TiDB/Schema/TiDB.cpp (2)
411-417
: Improved type safety and error handling for vector index.The use of
tipb::VectorIndexKind
andtipb::VectorDistanceMetric
, along with runtime checks, enhances type safety and robustness.
475-499
: Enhanced clarity and robustness in vector index parsing.The explicit parsing and structured initialization improve clarity and error handling in vector index management.
dbms/src/Interpreters/Context.cpp (4)
57-57
: Include Directive Update Approved.The change from
VectorIndex.h
toVectorIndexCache.h
aligns with modifications in cache handling and improves clarity.
1388-1388
: Simplified Cache Reset Approved.The direct call to
reset()
on the shared pointer enhances clarity and conciseness.
1410-1410
: Simplified Cache Reset Approved.The direct call to
reset()
on the shared pointer enhances clarity and conciseness.
1391-1397
: Method Signature Change Approved.The parameter change from
cache_size_in_bytes
tocache_entities
aligns with the new focus on entity count. Ensure all usages of this method are updated accordingly.Verification successful
Method Signature Change Verified.
All usages of
setVectorIndexCache
have been updated to reflect the new parametercache_entities
. The changes are consistent across the codebase.
dbms/src/Server/Server.cpp
: Usage withvec_index_cache_entities
.dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp
: Usage with1000
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of `setVectorIndexCache` to ensure they match the new parameter `cache_entities`. # Test: Search for the method usage. Expect: Only occurrences of the new parameter name. rg --type cpp -A 3 $'setVectorIndexCache'Length of output: 1274
dbms/src/Server/Server.cpp (1)
1442-1445
: Verify the impact of changing vector index cache configuration.The change from a byte size to an entity count for the vector index cache could have significant implications on memory management and performance. Ensure that this change is well-documented and tested to confirm it meets the desired performance and memory usage goals.
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (5)
93-96
: Addition ofcdVec()
method is appropriate.The
cdVec()
method provides a useful utility for defining vector columns, enhancing code reusability and readability.
884-887
: Update toVectorIndexDefinition
is appropriate.The change from
VectorIndexInfo
toVectorIndexDefinition
enhances type safety and aligns with the refactoring efforts. Ensure that all usages ofVectorIndexDefinition
are correctly updated throughout the codebase.
1427-1907
: Comprehensive test cases for S3 and file cache interactions.The test cases cover critical scenarios, including cache eviction and S3 failures, ensuring robustness. Consider verifying the test coverage to ensure all edge cases are addressed.
Verification successful
Comprehensive test cases for S3 and file cache interactions confirmed.
The test cases in
VectorIndexSegmentOnS3Test
cover various scenarios, ensuring robust handling of S3 and file cache interactions. The coverage appears comprehensive.
FileCacheNotEnabled
ReadWithoutIndex
ReadFromIndex
FileCacheEvict
FileCacheEvictAndVectorCacheDrop
FileCacheDeleted
FileCacheDeletedAndVectorCacheDrop
ConcurrentDownloadFromS3
S3Failure
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify test coverage for S3 and file cache interactions. # Test: Search for test cases related to S3 and file cache. Expect: Comprehensive coverage. rg --type cpp --type c++ --type c 'TEST_F\(VectorIndexSegmentOnS3Test'Length of output: 100
Script:
#!/bin/bash # Description: Verify test coverage for S3 and file cache interactions. # Test: Search for test cases related to S3 and file cache. Expect: Comprehensive coverage. rg --type cpp 'TEST_F\(VectorIndexSegmentOnS3Test'Length of output: 1111
1207-1290
: S3 interaction and file cache management setup is well-implemented.The
SetUp
andTearDown
methods ensure proper configuration and cleanup of S3 and file cache environments, which are essential for the tests. Verify that these configurations are consistent with the overall system requirements.Verification successful
S3 and file cache management setup is consistent across the codebase.
The configurations in
gtest_dm_vector_index.cpp
align with the widespread use ofS3::ClientFactory::instance()
andFileCache::initialize()
in other parts of the codebase, indicating a consistent approach. This setup is well-implemented and matches the overall system requirements.
- Instances found in files like
FlashService.cpp
,S3LockService.cpp
,Server.cpp
, and others confirm the consistency.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify S3 configuration and file cache management in the codebase. # Test: Search for S3 configuration and file cache management. Expect: Consistent configuration across the codebase. rg --type cpp --type c++ --type c 'S3::ClientFactory|FileCache::initialize'Length of output: 105
Script:
#!/bin/bash # Description: Verify S3 configuration and file cache management in the codebase. # Test: Search for S3 configuration and file cache management. Expect: Consistent configuration across the codebase. rg --type cpp --type c 'S3::ClientFactory|FileCache::initialize'Length of output: 9671
205-208
: Update toVectorIndexDefinition
is appropriate.The change from
VectorIndexInfo
toVectorIndexDefinition
enhances type safety and aligns with the refactoring efforts. Ensure that all usages ofVectorIndexDefinition
are correctly updated throughout the codebase.Verification successful
VectorIndexDefinition
is consistently used across the codebase.The transition from
VectorIndexInfo
toVectorIndexDefinition
is well-implemented throughout the relevant files. This change enhances type safety and aligns with the refactoring objectives.
dbms/src/TiDB/Schema/VectorIndex.h
: Definition and usage ofVectorIndexDefinition
.dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp
: Multiple instances ofVectorIndexDefinition
usage in tests.- Other related files in
DeltaMerge
andTiDB
modules.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of `VectorIndexDefinition` in the codebase. # Test: Search for `VectorIndexDefinition` usage. Expect: Consistent usage across the codebase. rg --type cpp --type c++ --type c 'VectorIndexDefinition'Length of output: 85
Script:
#!/bin/bash # Description: Verify all usages of `VectorIndexDefinition` in the codebase. # Test: Search for `VectorIndexDefinition` usage. Expect: Consistent usage across the codebase. rg --type cpp 'VectorIndexDefinition'Length of output: 3281
FileSegmentPtr FileCache::getOrWait(const S3::S3FilenameView & s3_fname, const std::optional<UInt64> & filesize) | ||
{ | ||
auto s3_key = s3_fname.toFullKey(); | ||
auto file_type = getFileType(s3_key); | ||
auto & table = tables[static_cast<UInt64>(file_type)]; | ||
|
||
std::unique_lock lock(mtx); | ||
|
||
auto f = table.get(s3_key); | ||
if (f != nullptr) | ||
{ | ||
lock.unlock(); | ||
f->setLastAccessTime(std::chrono::system_clock::now()); | ||
auto status = f->waitForNotEmpty(); | ||
if (status == FileSegment::Status::Complete) | ||
{ | ||
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_hit).Increment(); | ||
return f; | ||
} | ||
else | ||
{ | ||
// On-going download failed, let the caller retry. | ||
return nullptr; | ||
} | ||
} | ||
|
||
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_miss).Increment(); | ||
|
||
auto estimzted_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type); | ||
if (!reserveSpaceImpl(file_type, estimzted_size, /*try_evict*/ true)) | ||
{ | ||
// Space not enough. | ||
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_full).Increment(); | ||
LOG_DEBUG( | ||
log, | ||
"s3_key={} space not enough(capacity={} used={} estimzted_size={}), skip cache", | ||
s3_key, | ||
cache_capacity, | ||
cache_used, | ||
estimzted_size); | ||
|
||
// Just throw, no need to let the caller retry. | ||
throw Exception( // | ||
ErrorCodes::S3_ERROR, | ||
"Cannot reserve {} space for object {}", | ||
estimzted_size, | ||
s3_key); | ||
return nullptr; | ||
} | ||
|
||
auto file_seg | ||
= std::make_shared<FileSegment>(toLocalFilename(s3_key), FileSegment::Status::Empty, estimzted_size, file_type); | ||
table.set(s3_key, file_seg); | ||
lock.unlock(); | ||
|
||
++PerfContext::file_cache.fg_download_from_s3; | ||
fgDownload(s3_key, file_seg); | ||
if (!file_seg || !file_seg->isReadyToRead()) | ||
throw Exception(ErrorCodes::S3_ERROR, "Download object {} failed", s3_key); | ||
|
||
return file_seg; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in variable name.
The variable estimzted_size
contains a typo. It should be estimated_size
.
- auto estimzted_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type);
+ auto estimated_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type);
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
FileSegmentPtr FileCache::getOrWait(const S3::S3FilenameView & s3_fname, const std::optional<UInt64> & filesize) | |
{ | |
auto s3_key = s3_fname.toFullKey(); | |
auto file_type = getFileType(s3_key); | |
auto & table = tables[static_cast<UInt64>(file_type)]; | |
std::unique_lock lock(mtx); | |
auto f = table.get(s3_key); | |
if (f != nullptr) | |
{ | |
lock.unlock(); | |
f->setLastAccessTime(std::chrono::system_clock::now()); | |
auto status = f->waitForNotEmpty(); | |
if (status == FileSegment::Status::Complete) | |
{ | |
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_hit).Increment(); | |
return f; | |
} | |
else | |
{ | |
// On-going download failed, let the caller retry. | |
return nullptr; | |
} | |
} | |
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_miss).Increment(); | |
auto estimzted_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type); | |
if (!reserveSpaceImpl(file_type, estimzted_size, /*try_evict*/ true)) | |
{ | |
// Space not enough. | |
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_full).Increment(); | |
LOG_DEBUG( | |
log, | |
"s3_key={} space not enough(capacity={} used={} estimzted_size={}), skip cache", | |
s3_key, | |
cache_capacity, | |
cache_used, | |
estimzted_size); | |
// Just throw, no need to let the caller retry. | |
throw Exception( // | |
ErrorCodes::S3_ERROR, | |
"Cannot reserve {} space for object {}", | |
estimzted_size, | |
s3_key); | |
return nullptr; | |
} | |
auto file_seg | |
= std::make_shared<FileSegment>(toLocalFilename(s3_key), FileSegment::Status::Empty, estimzted_size, file_type); | |
table.set(s3_key, file_seg); | |
lock.unlock(); | |
++PerfContext::file_cache.fg_download_from_s3; | |
fgDownload(s3_key, file_seg); | |
if (!file_seg || !file_seg->isReadyToRead()) | |
throw Exception(ErrorCodes::S3_ERROR, "Download object {} failed", s3_key); | |
return file_seg; | |
} | |
auto estimated_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type); | |
if (!reserveSpaceImpl(file_type, estimated_size, /*try_evict*/ true)) | |
{ | |
// Space not enough. | |
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_full).Increment(); | |
LOG_DEBUG( | |
log, | |
"s3_key={} space not enough(capacity={} used={} estimated_size={}), skip cache", | |
s3_key, | |
cache_capacity, | |
cache_used, | |
estimated_size); | |
// Just throw, no need to let the caller retry. | |
throw Exception( // | |
ErrorCodes::S3_ERROR, | |
"Cannot reserve {} space for object {}", | |
estimated_size, | |
s3_key); | |
return nullptr; | |
} |
size_t VectorIndexCache::cleanOutdatedCacheEntries() | ||
{ | ||
size_t cleaned = 0; | ||
|
||
std::unordered_set<String> files; | ||
{ | ||
// Copy out the list to avoid occupying lock for too long. | ||
// The complexity is just O(N) which is fine. | ||
std::shared_lock lock(mu); | ||
files = files_to_check; | ||
} | ||
|
||
for (const auto & file_path : files) | ||
{ | ||
if (is_shutting_down) | ||
break; | ||
|
||
if (!cache.contains(file_path)) | ||
{ | ||
// It is evicted from LRU cache | ||
std::unique_lock lock(mu); | ||
files_to_check.erase(file_path); | ||
} | ||
else if (!Poco::File(file_path).exists()) | ||
{ | ||
LOG_INFO(log, "Dropping in-memory Vector Index cache because on-disk file is dropped, file={}", file_path); | ||
{ | ||
std::unique_lock lock(mu); | ||
files_to_check.erase(file_path); | ||
} | ||
cache.remove(file_path); | ||
cleaned++; | ||
} | ||
} | ||
|
||
LOG_DEBUG(log, "Cleaned {} outdated Vector Index cache entries", cleaned); | ||
|
||
return cleaned; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider optimizing lock usage.
The method uses a shared lock to copy files_to_check
, which is efficient. However, the repeated acquisition of a unique lock for each file could be optimized by batching operations if possible.
// Example: Consider batching operations that require a unique lock.
// Vector index files are always not written into the merged file | ||
// because we want to allow to be mmaped by the usearch. | ||
|
||
auto buffer = ChecksumWriteBufferBuilder::build( | ||
merged_file.buffer, | ||
dmfile->getConfiguration()->getChecksumAlgorithm(), | ||
dmfile->getConfiguration()->getChecksumFrameLength()); | ||
|
||
stream->vector_index->serializeBinary(*buffer); | ||
|
||
col_stat.index_bytes = buffer->getMaterializedBytes(); | ||
const auto index_name = dmfile->colIndexPath(stream_name); | ||
stream->vector_index->save(index_name); | ||
col_stat.index_bytes = Poco::File(index_name).getSize(); | ||
|
||
// Memorize what kind of vector index it is, so that we can correctly restore it when reading. | ||
col_stat.vector_index = dtpb::ColumnVectorIndexInfo{}; | ||
col_stat.vector_index->set_index_kind(String(magic_enum::enum_name(stream->vector_index->kind))); | ||
col_stat.vector_index.emplace(); | ||
col_stat.vector_index->set_index_kind( | ||
tipb::VectorIndexKind_Name(stream->vector_index->definition->kind)); | ||
col_stat.vector_index->set_distance_metric( | ||
String(magic_enum::enum_name(stream->vector_index->distance_metric))); | ||
|
||
MergedSubFileInfo info{ | ||
fname, | ||
merged_file.file_info.number, | ||
merged_file.file_info.size, | ||
col_stat.index_bytes}; | ||
dmfile_meta->merged_sub_file_infos[fname] = info; | ||
|
||
merged_file.file_info.size += col_stat.index_bytes; | ||
buffer->next(); | ||
tipb::VectorDistanceMetric_Name(stream->vector_index->definition->distance_metric)); | ||
col_stat.vector_index->set_dimensions(stream->vector_index->definition->dimension); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor vector index file handling in finalizeColumn
.
The logic for handling vector index files in finalizeColumn
has been updated to save vector indices directly to a path derived from the column name. This change simplifies the process and may improve performance. Ensure that this refactoring aligns with the overall design and performance goals.
What problem does this PR solve?
Issue Number: close #xxx
Problem Summary:
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
New Features
contains
method in the LRUCache for non-intrusive key checks.VectorIndexCache
for managing vector index entries with LRU caching.Bug Fixes
Documentation
Refactor
Tests