-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add convenience functions for parsing the PQ index #349
Add convenience functions for parsing the PQ index #349
Conversation
If the interface is ok with you I would prefer to merge it now and update the implementation after Gopal's PR so it is simpler to add it to my tool. If you want to have interface change, we can wait for Gopal's PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @harsha-simhadri that we could hold this off until the datastore/graphstore PR is done, if it is not urgent.
008f5fc
to
b9eb917
Compare
Hi @harsha-simhadri, do you mind to re-review/merge? If I move your new |
* move read_nodes to public, add get_pq_vector and get_num_points * clang-format * Match new private var naming convention * more private (_) fixes * VID->vid * VID->vid cpp
* add codebook passing and pq/opq dim overwrite. * Support per query filter (#279) * Transferring Varun's chagges from external fork with squash merge * generating multiple gt's for each filter label + search with multiple filter labels (code cleanup) * supporting no-filter + one filter label + filter label file (multiple filters) while computing GT * generating multiple gt's + refactoring code for readability & cleanliness * adding more tests for filtered search * updating pr-test to test filtered cases * lowering recall requirement for disk index * transferred functions to filter_utils * adding more test for build and search without universal label * adding one_per_point distribution to generate_synthetic_labels + cleaning up artifacts after compute gt+ removing minor errors * refactoring search_disk_index to use a query filter vector --------- Co-authored-by: patelyash <patelyash@microsoft.com> Co-authored-by: Varun Sivashankar <t-varunsi@microsoft.com> * Rebasing main's latest commits onto ravi/filter_support_rebased (#225) - add code for two variants of filtered index, readme and CI tests - add utils for synthetic label generation and CI tests. * Add co-authors Co-authored-by: ravishankar <rakri@microsoft.com> Co-authored-by: Varun Sivashankar <t-varunsi@microsoft.com> --------- Co-authored-by: ravishankar <rakri@microsoft.com> Co-authored-by: David Kaczynski <dkaczynski@microsoft.com> Co-authored-by: Siddharth Gollapudi <t-gollapudis@microsoft.com> Co-authored-by: Neelam Mahapatro <nmahapatro@microsoft.com> Co-authored-by: Harsha Vardhan Simhadri <harshasi@microsoft.com> Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com> Co-authored-by: REDMOND\patelyash <patelyash@microsoft.com> Co-authored-by: Varun Sivashankar <t-varunsi@microsoft.com> * Clang-format now errors on push and PR if formatting is incorrect (#236) * Rather than sift through all the *.cpp and *.h in the root directory, we're looking for only the sources in our main repository for formatting. Git submodules are excluded * Removing the --Werror flag only until we actually format all of the code in a future commit * We're choosing to base our style on the Microsoft style guide and not make any changes * Running format action on source code. Settling on Google styling. Settled on '.clang-format' instead of '_clang-format'. Fixed instructions such that only clang-format 12 is installed (13 changes SortIncludes options from true/false to a trinary set of options, none of which include the word 'false') * Enabling error on malformatted file * Revert "Enabling error on malformatted file" This reverts commit fa33e8284cb9ee815d882e516aaeb7be6800a982. * Revert "Running format action on source code. Settling on Google styling. Settled on '.clang-format' instead of '_clang-format'. Fixed instructions such that only clang-format 12 is installed (13 changes SortIncludes options from true/false to a trinary set of options, none of which include the word 'false')" This reverts commit e0281bec8c265ecd3b56d65f61e768238ed8b1c1. * Trying again; formatting rules based on Google rules, disables sorting includes as that breaks us, and enabling check on build. * Somehow this was missed in the mass format. Formatting include/distance.h. * Manually fixing the formatting because clang-format wouldn't, but WOULD flag it as invalid * Update SSD_index.md (#258) Fix typo in SSD index readme * Add filter-diskann paper link to readme (#275) * Update README.md (#277) * update citation (#281) * Some fixes to pass internal building pipeline (#282) Remove warnings affecting internal build pipelines --------- Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> * Add support for multiple frozen points (#283) * Add support for multiple frozen points * Add the missing parameters to the constructor. * Added filtered disk index readme (#276) * Added filtered disk index readme * Support per query filter (#279) * Transferring Varun's chagges from external fork with squash merge * generating multiple gt's for each filter label + search with multiple filter labels (code cleanup) * supporting no-filter + one filter label + filter label file (multiple filters) while computing GT * generating multiple gt's + refactoring code for readability & cleanliness * adding more tests for filtered search * updating pr-test to test filtered cases * lowering recall requirement for disk index * transferred functions to filter_utils * adding more test for build and search without universal label * adding one_per_point distribution to generate_synthetic_labels + cleaning up artifacts after compute gt+ removing minor errors * refactoring search_disk_index to use a query filter vector --------- Co-authored-by: patelyash <patelyash@microsoft.com> Co-authored-by: Varun Sivashankar <t-varunsi@microsoft.com> * udpate merging code * Using boost program options under Visual Studio MSVC 14.0 Assertion failed * some commts and rewriting * add back LF which might be confict with MSVC 14.0 * clang formating change * clang formating * revert back to Lf * unexpected failure on UT re-try * adding default string to the path * fix reference issue * Fixing Build errors in remove_extra_typedef (#290) remove _u, _s typedefs * converting uint64's to size_t where they represent array offsets --------- Co-authored-by: harsha vardhan simhadri <harsha.v.simhadri@gmail.com> * clang format * bump it up to 512 for MAX_PQ_CHUNKS * default codebook prefix value pass in for generate_quantized_data * add check for disabling both -B and -QD pass in * remove rules for force only one of -B and -QD * clange change * change clang format * bring back -B params * generate_quantized_data pass in referemce instead of const string * update clang and param reference * updated dockerfile (#299) * updated dockerfile * add parallel build flag to dockerfile * Adds CI jobs to build our docker container (#302) * Adding a step that at least builds the docker container. I'm not yet sure how I want to actually integrate tests within the container, but at the least we should verify it builds * docker build needs a path. i honestly thought it defaulted to the CWD --------- Co-authored-by: Dax Pryce <daxpryce@microsoft.com> * Python API and Test Suite (#300) * The first step in the python-api-enhancements branch. We need to fix a problem with the Parameters class with a double free or segfault on deletion. * Removing the parameters class in favor of the IndexRead and IndexWrite parameters classes. * API changes and python packaging changes for linux. It's almost ready for PR, but definitely ready for push. * Suppressing the CIBuildWheel step on windows * added in-mem static and dynamic index class to python bindings (#301) * Advancing our version number to 0.5.0 * Some more updates as per harsha's comments on PR #300. The diskann_bindings.cpp still need some more tlc and the wrapper needs to make use of it, and we also want to include some examples, but this is a good place to bring into main and then do further enhancements --------- Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com> * reducing number of L values for stitched search (#307) * reducing number of L values for stitched search in CI * add a warning in prune_neighbor if zero distance neighbor is detected (#320) * Fix condition on ubuntu version in README (#246) * Fix building SSD index performance issue (#321) Fix performance gap between in-mem and SSD based graph built by passing an appropriate number of threads. --------- Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com> * remove the distance 0 warning in prune candidate the list, since diskann::cerr does not seem thread safe (#330) * Set compile warning as error for core projects (#331) * set(CMAKE_COMPILE_WARNING_AS_ERROR ON) --------- Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> * Create a data store abstraction (#305) Create a virtual data store base class and a derived in-mem store class. In-mem index now uses the data store class. --------- Co-authored-by: Gopal Srinivasa <gopalsr@microsoft.com> Co-authored-by: ravishankar <rakri@microsoft.com> Co-authored-by: yashpatel007 <patelyash1311@gmail.com> * Disabling Python builds (#338) * Disabling Python builds debian stretch no longer seems to have valid apt repos - or at least not ones that we can access - which means our cibuildwheel is failing. * New python interface, build setup, apps and unit tests (#308) --------- Co-authored-by: Dax Pryce <daxpryce@microsoft.com> * Adding some diagnostics to a pr build in an attempt to see what is going on with our systems prior to running our streaming/incremental tests * fix cast error and add some status prints to in-mem-dynamic app * Adding unit tests for both memory and disk index builder methods * After the refactor and polish of the API was left half done, I also left half a jillion bugs in the library. At least I'm confident that build_memory_index and StaticMemoryIndex work in some cases, whereas before they barely were getting off the ground * Sanity checks of static index (not comprehensive coverage), and tombstone file for test_dynamic_memory_index * Argument range checks of some of the static memory index values. * fixes for dynamic index in python interface (#334) * create separate default number of frozen points for dynamic indices * consolidate works * remove superfluous param from dynamic index * remove superfluous param from dynamic index * batch insert and args modification to apps * batch insert and args modification to apps * typo * Committing the updated unit tests. At least the initial sanity checks of StaticMemory are done * Fixing an error in the static memory index ctor * Formatting python with black * Have to disable initial load with DynamicMemoryIndex, as there is no way to build a memory index with an associated tags file yet, making it impossible to load an index without tags * Working on unit tests and need to pull harsha's changes * I think I aligned this such that we can execute it via command line with the right behaviors * Providing rest of parameters build_memory_index requires * For some reason argparse is allowing a bunch of blank space to come in on arguments and they need stripped. It also needs to be using the right types. * Recall test now works * More unit tests for dynamic memory index * Adding different range check for alpha, as the values are only really that realistic between 1 and 2. Below 1 is an error, and above 2 we'll probably make a warning going forward * Storing this while I cut a new branch and walk back some work for a future branch * Undoing the auto load of the dynamic index until I can debug why my tag vector files cause an error in diskann * Updating the documentation for the python bindings. It's a lot closer than it was. * Fixing a unit test * add timers to dyanmic apps (#337) * add timers to dyanmic apps * clang format * np.uintc vs. int for dtype of tags * fixes to types in dynamic app * cast tags to np.uintc array * more timers * added example code in comments in app file * round elapsed * fix typo * fix typo --------- Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com> Co-authored-by: harsha vardhan simhadri <harsha.v.simhadri@gmail.com> * Harshasi/timer python app (#341) * added timer and QPS to static search app * search only option to static index * search only option to static index * exposing metric in static function * Force error on warnings and add casts to test directory (#342) * Force error on warnings and add casts to test directory * Use size_t for index of point IDs * Refactor iterator and conditions for printing labels --------- Co-authored-by: David Kaczynski <dkaczynski@microsoft.com> * Enable Windows python bindings (#343) * Use int64 for counter to fix windows compilation error * Fix windows python bindings by adding install_lib command to move windows build output into python package * Update to use Path instead of os * Change batch_insert num_inserts signature to signed type for OpenMP compatibility * Update num_inserts to int32_t per PR request --------- Co-authored-by: Nick Caurvina <nicaurvi@microsoft.com> * Use new macro(ENABLE_CUSTOM_LOGGER) to turn on Custom logger (#345) * custom logger --------- Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> * updting from std cpp 14 to cpp 17 (#352) * updting from std cpp 14 to cpp 17 * adding cmake_cxx_standard flag * CICD Refactor (#354) * Refactored the build processes. Broke things into components as much as possible. We have standalone actions for the build processes to make sure they are consistent across push or PR builds, a format-check that doesn't rely on cmake to be there to work, and centralized our randomized data generation into a single action that can be called in each section. We now are reusing as many of the steps as we can without copy/pasting, which should ensure we're not making mistakes. * Fixing the dynamic tests, the paths to the data were wrong --------- Co-authored-by: yashpatel007 <patelyash1311@gmail.com> * Fix the disparity between disk and memory search for Universal label (#347) * UNV Search Fix for Memory * two places to update * clang format * unify find_common_filters function * fix comments - only return size of common filters from the find_common_filters function * dummy comments * clang format * Reduce repetitive calls * changing name and return type of function * Remove compute_groundtruth from labels.yml (#363) Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> * Handle some corner cases in generate_cache_list_from_sample_queries (#361) Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> * Reduce the size of coord_scratch in SSDQueryScratch to reduce memory usage (#362) * Remove useless coord_scratch in SSDQueryScratch to reduce memory usage --------- Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> * Upload data and binary files to artifact in CI workflows (#366) * Upload data and binary files to artifact so that we could debug issue locally when the workflows fails * use different artifact name for different scenarios --------- Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> * Python Type Enhancements (#364) * Adding cosine distance - I didn't know we had that as a first level distance metric * Making our mkl and iomp linking game more rigorously defined for the ubuntus * Included latest as a path fragment twice on accident * libmkl_def.so is named something different when installed via the intel oneapi installer * Making a number of changes to homogenize our api (same parameters, minimize parameters as much as possible, etc) * Stashing this and going to work on the CICD stuff, it's driving me nuts * Fairly happy with the Python API now. Documentation needs another pass, the @overloads in the .pyi files need to be addressed, and documentation checked again. The apps folder also needs updating to use fire instead of argparse * Updated build to not use tcmalloc for pybind, as well as fixed the pyproject.toml so that cibuildwheel can actually successfully build our project. * Making a change to in-mem-static for the new api and also adjusting the comment in in-mem-dynamic a bit, though... I probably shouldn't have * Add unit test project based on boost_unit_test_framework (#365) * Add unit test project based on boost_unit_test_framework * Add another dockerfile for developers * update path --------- Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> * Fix inefficiency in constructing reverse label map (#373) * single loop for reverse label map * clang formatting * unnecessary comments removed * minor --------- Co-authored-by: Varun Sivashankar <t-varunsi@microsoft.com> * fixed a bug with loading medoids for sharded filtered index, and adde… (#368) * fixed a bug with loading medoids for sharded filtered index, and added better caching for filtered index clang-format fixed minor cout error addressed Yiyong's comments, and fixed a bug for finding medoid in sharded+filtered index Fixed windows compile error (warnings) Fix inefficiency in constructing reverse label map (#373) * single loop for reverse label map * clang formatting * unnecessary comments removed * minor --------- Co-authored-by: Varun Sivashankar <t-varunsi@microsoft.com> clang-formatted * minor cleanup * clang-format --------- Co-authored-by: ravishankar <rakri@microsoft.com> * patelyash/index factory (#340) * gi# This is a combination of 2 commits. remove _u, _s typedefs * added some seed files * add seed files * New distance metric hierarchy * Refactoring changes * Fixing compile errors in refactored code * Fixing compile errors * DiskANN Builds with initial refactoring changes * Saving changes for Ravi * More refactoring * Refactor * Fixed most of the bugs related to _data * add seed files * gi# This is a combination of 2 commits. remove _u, _s typedefs * added some seed files * New distance metric hierarchy * Refactoring changes * Fixing compile errors in refactored code * Fixing compile errors * DiskANN Builds with initial refactoring changes * Saving changes for Ravi * More refactoring * Refactor * Fixed most of the bugs related to _data * Post merge with main * Refactored version which compiles on Windows * now compiles on linux * minor clean-up * minor bug fix * minor bug * clang format fix + build error fix * clang format fix * minor changes * added back the fast_l2 feature * added back set_start_points in index.cpp * Version for review * Incorporating Harsha's comments - 2 * move implementation of abstract data store methods to a cpp file * clang format * clang format * Added slot manager file (empty) and fixed compile errors * fixed a linux compile error * clang * debugging workflow failure * clang * more debug * more debug * debug for workflow * remove slot manager * Removed the #ifdef WINDOWS directive from class definitions * Refactoring alignment factor into distance hierarchy * Fixing cosine distance * Ensuring we call preprocess_query always * Fixed distance invocations * fixed cosine bug, clang-formatted * cleaned up and added comments * clang-formatted * more clang-format * clang-format 3 * remove deleted code in scratch.cpp * reverted clang to Microsoft * small change * Removed slot_manager from this PR * newline at EOF in_mem_Graph_store.cpp * rename distance_metric to distance_fn * resolving PR comments * minor bug fix for initialization * creating index_factory * using index factory to build inmem index * clang format fix * minor bug fix * fixing build error * replacing mem_store with abstract_mem_store + injecting data_store to Index * minor fix * clang format fix * commenting data_store injection to prevent double invocation and mem leak (for now) * fixing the build for fiters * moving abstract index to abstract_index.h * IndexBuildParamsbuilder to build IndexBuildParams properly with error checking * fixing build errors * fixing minor error * refactoring index search to be simple * clang format fix * refactoring search_mem_index to use index factory * clang fix * minor fix * minor fix for build * optimize for fast l2 restore * removing comments * removing comments * adding templating to IndexFactory (can't avoide it anymore) * fixing build error * fixing ubuntu build error * ubuntu build exception fix * passing num_pq_bytes * giving one more shot to config dricen arch with boost::any (type erasure) * clang fix * modifying search to use boost::any * fixing ubuntu build errors/warning * created indexconfigbuilder and fixed a typo * fixing error in pq build * some comments + lazy_delete impl * bumping to std c++17 & replacing boost::any with std::any * clang fix * c++ std 17 for ubuntu * minor fix * converting search to batch_search + A vector wrapper using std::any to store vector as a shared ptr * adding AnyVector to encapsulate vector in std::any + adding basic yaml parser(WIP) * adding wrapper code for vector and set, checked with Andrija * fixinh ubuntu build error * trying to resolve ubuntu build error * testing test streaming index with IndexFactory * fixing ubuntu build error * fixing search for test insert delete consolidate * refactored test_streaming_scenario * refactored test_insert_delete_consolidate to use AbstractIndex and Indexfactory * fixing ubuntu build error * making build method in abstract index consistent * some code cleanup + abstract_cpp to add implementation * remoing coments and code cleanup * build error fix * fixing -Wreorder warning * separating build structs to their header + refactor search and remove batch search * fixing ubuntu build errors * resolving segfault error from search_mem_index * fixing query_result_tag allocation * minor update * search fix * trying to fix windows latest build for dynamic index * ading temp loggin to debug windows latest build issue * removing logging for debug * fixning windows latest build error for dynamix index search * moving any wrappers to separate file + organizing code * fixing check error * updating private vsr naming convention * minor update * unravelig search methods in abstract index. Iteraton 1 * minor fix * unused vars remove * returning a unique_ptr to Abstract Index from index factory * adding implementation from abstract_index.h to abstract_index.cpp * making abstract index api to be more explicit (expriment) * some code cleanup * removing detected memory leaks (free up index) * separtaing enums for data and graph stratagy * Index ctor(config) now uses injected datastore from IndexFactory * distance in index population in new config ctor * resolving some comments from Andrija * Resolving some restructuring comments by Andrija * minor fix * fixing ubuntu build error * warning fix * simplified get() in anywrappers * making index config a unique ptr and owned by IndexFactory * removing complex if/else calling recursively + added unimplemented TagT to AbsIdx * renaming get_instance to create_instance * clang format fix * removing const_cast from any_wrapper * fixing andrija's comments * removing warnings --------- Co-authored-by: harsha vardhan simhadri <harsha.v.simhadri@gmail.com> Co-authored-by: Gopal Srinivasa <gopalsr@microsoft.com> Co-authored-by: ravishankar <rakri@microsoft.com> Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com> * patelyash/index factory (#340) (#380) --------- Co-authored-by: Yash Patel <47032340+yashpatel007@users.noreply.github.com> Co-authored-by: harsha vardhan simhadri <harsha.v.simhadri@gmail.com> Co-authored-by: Gopal Srinivasa <gopalsr@microsoft.com> Co-authored-by: ravishankar <rakri@microsoft.com> Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com> * hot fix for python build (#383) * some bug fix when enable the EXEC_EnV_OLS (#377) * some bug fix when enable the EXEC_EnV_OLS * avoid unit test failure * unit test testing * changed based on gopal's suggestion * update load_impl(AlignedFileReader &reader) * change the load_impl to be identical to objectstore * remvoe blank * Output distance file in memory index search (#382) * Output distance file * fix --------- Co-authored-by: Shengjie Qian <shenqian@microsoft.com> * Add WIN macro for non-win function (#360) * Add WIN macro for non-win funtion * fix vc16 compile issue * fix compile issue * fix compile issue * fix compile issue * clean up code * small EXEC_ENV_OLS bug fix (#387) * small bug fix * test ubuntu fail * formatting * re-triggering unitest * Python Refactor (#385) * Refactor of diskannpy module code. * 0.5.0.rc1 for python and enabling the build-python portion of the pr-test process. * clang-format changes * In theory this should speed up the python build drastically by only building the wheel for the python version and OS we're attempting to fan out to in our CICD job tree * Missed a dollar sign * Copy/pasting left a CICD step name that implied we were running a code formatting check when instead we were building a wheel. This is now fixed. * In theory, readying the release action too. We won't know if it works until it merges and we cut a release, but at least the paths have been fixed * Designated initializers just happened to work on linux but shouldn't have as they weren't added until cpp20 * Formatting * Jinweizhang/filter paramsfix (#388) * small bug fix * test ubuntu fail * formatting * re-triggering unitest * cause error, remove two character params * cause error, remove two character params * unit test fix * clean up code * add more accurate error handelling * fix filter build * re-trigger test * try lower recall number * test witl more value * revert back to test unit test * Update python-release.yml Github actions fix: composite action `python-wheel` publishes wheels to the `wheels` artifact. `python-release` workflow then looks for it in the `dist` artifact, which does not exist. This is a CICD change only. * Fixed inputs type-o (#391) * Fixed inputs type-o * Action 'checkout@v2' is deprecated * Update pyproject.toml Trying a new release of the python lib to see if there was a packaging error in the publication of rc1. * Fixed param documentation (#393) * Fixed param name in comments * Hide rust/target * Bypass errors in logging for non-msft-prod environments (#392) * Removed the logger and verified that the logging capability is the root cause of our consistent segfault errors in python. Perhaps it also will fix any issues in our label test too? I'd like to push it to GH and see. * Formatting fixes * Revert "Formatting fixes" This reverts commit 9042595614c0f3b5e72f61090538abdb6510af14. * Revert "Removed the logger and verified that the logging capability is the root cause of our consistent segfault errors in python. Perhaps it also will fix any issues in our label test too? I'd like to push it to GH and see." This reverts commit 7561009932ff109ed386c4f5d50983859e49b9e7. * The custom logging implementation is causing segfaults in python. We're not sure exactly where, but this is the easiest and quickest way to getting a working python release. * All the integration tests are failing, and there's a chance the virtual dtor on AbstractDataStore might be the culprit, though I am not sure why. I'm hoping it is so it won't fall on the logging changes. * Formatting. Again. * Improve help formatting in CLI tools (#390) * Added utilities to standardize help across cli tools. #370 * Made three option groupings (required/optional/print) * Moved common parameter descriptions to a common file. #370 * Updated usage statement for search_disk_app #370 * Updated range_search_disk_index to use the new required/optional format. #370 * Updated test apps to use the new help format. #370 * Fixed format issue. #370 * Updated help format for the 'build' apps. #370 * Fixed code formatting. #370 * Added src/*.hpp to the clang format. #370 * Moved header into the headers directory. #370 * Added missing configs. #370 * Removed superflous paths from include. #370 * Added #pragma once. #370 * Type-o fixes. #370 * Fixed capitolization of constant. #370 * Make fail_if_recall description more accurate. #370 * Changed to using set notation. #370 * Better explanations for some options. #370 * Added short explanation of file format. #370 --------- Co-authored-by: Jon McLean <none@example.com> Co-authored-by: Jonathan McLean <Jonathan.McLean@microsoft.com> * Python build with a far more portable wheel (#396) * Identified the appropriate build flags to get a working python build that doesn't rely on -march=native or -mtune=native. We've run benchmarks on multiple computers that indicate the only important flag other than -mavx2 -msse2 -mfma is -funroll-loops. Optimization levels such as -O1, -O2, or -O3 actually makes for less performant code. -Ofast is unavailble for use in Python, as it causes problems with floating point math in Python * 1.22 was left in a comment despite 1.25 being the value specified * Python 3.8 is not supported by numpy 1.25, so we're removing it. * Jomclean/write timings (#397) * Work-in-progress commit adding JSON output for timings. in-mem-static is complete * Added timings to dynamic and total-time to static * Update pyproject.toml (#398) Using the correct README for our publication to pypi. * Added filename to log (#399) * Jinwei/fix in memory compile error (#401) * small bug fix * test ubuntu fail * formatting * re-triggering unitest * add small fix for in_mem_data_store when EXEC_ENV_OLS is enabed * fix: use the passed in io_limit (#403) * fix: use the passed in io_limit * fix to be clang-formatted * DynamicMemoryIndex bug fixes (#404) * While simply creating a unit test to repro Issue #400, I found a number of bugs that I needed to address just to get it to work the way I had intended. This does not yet have what I would consider a comprehensive suite of test coverage for the DynamicMemoryIndex, but we at least do save it with the metadata file, we can load it correctly, and saving *always* consolidate_deletes() prior to save if any item has been marked for deletion prior to save. * We actually cannot save without compacting before save anyway. Removing the parameter from save() and hardcoding it to True until we can actually support it. * Addressing some PR comments and readying a 0.5.0.rc5 release * Pass nullptr as nullT when creating thread_data that's of ConcurrentQueue<SSDThreadData*> type, otherwise the default null_T is uninitialized, could point to arbitraty memory (#408) * Preparing for 0.6.0 diskannpy release (#407) * Some early staging for README updates and pyproject updates for a 0.6.0 release for diskannpy. * Trying to fix the CI badge to point toward main's latest build * Updating documentation for pdoc generation * Documentation updates. Tightened up the API to drop list support (there were entirely too many cases where it wouldn't work, and it's easier to just tell people to convert it themselves) * Some module reorganization to make pdoc actually display the docstrings for variables re-exported at the top level * A copy paste happened that shouldn't have. * Updating the apps to use the new 0.6.0 api * Addressing PR feedback * Some of the documentation changes didn't get made in both from_file or the constructor * Added PDoc workflow to publish github pages documentation (#412) * Added PDoc workflow * Added documentation to the push-test workflow * Added diskannpy to the env for pdoc to use * Initial commit of doc publish workflow * Tried heredoc to get python version * Tried another way of getting the version * Tried another way of getting the version * Moved to docs/python path * Removing the test harness * Add dependencies per wheel * Moved dependency tree to the 'push' file so it runs on push * Added label name to the dependency file * Trying maxtrix.os to get the os and version * Moved doc generation from push-test to python-release. Will add 'dev' doc generation to push-test * Publish latest/version docs only on release. Publish docs for every dev build on main. * Install the local-file version of the library * Disable branch check so I can test the install * Use python build to build a wheel for use in documentation * Tried changing to python instead of python3 * Added checkout depth in order to get boost * Use the python build action to create wheel for documentation * Revert "Use the python build action to create wheel for documentation" This reverts commit d900c1d42c0f4bc8295955e0d6da7a868a073661. * Added linux environment setup * Made only publish dev when on main and added comments --------- Co-authored-by: Jonathan McLean <Jonathan.McLean@microsoft.com> * Update README.md (#416) * moved ssd index defaults to defaults.h (#415) * moved ssd index constants to defaults.h * Add Performance Tests (#421) * Have a working dockerfile to run perf tests and report the times they take. We can also capture stdout/stderr with it for further information, especially for tools that report internal latencies. * Slight changes to the perf test script, a perf.yml for the github action * allow multi-sector layout for large vectors (#417) * make sector node an inline function * convert offset_node macro to inline method * rename member vars to start with underscore in pq_flash_index.h * added support in create_disk_index * add read sector util * load_cache_list now uses read_blocks util * allow nullptr for read_nodes * BFS cache generation uses util * add num_sectors info to cache_beam_Search * add CI test for 1020,1024,1536D float and 4096D int8 rand vector on disk * Consolidate Index Constructors (#418) * initial commit * updating python bindings to use new ctor * python binding error fix * error fix * reverting some changes -> experiment * removing redundnt code from native index * python build error fix * tyring to resolve python build error * attempt at python build fix * adding IndexSearchParams * setting search threads to non zero * minor check removed * eperiment 3-> making distance fully owned by data_store * exp 3 clang fix * exp 4 * making distance as unique_ptr * trying to fix build * finally fixing problem * some minor fix * adding dll export to index_factory static function * adding dll export for static fn in index_factory * code cleanup * resolving gopal's comments * resolving build failures * Add convenience functions for parsing the PQ index (#349) * move read_nodes to public, add get_pq_vector and get_num_points * clang-format * Match new private var naming convention * more private (_) fixes * VID->vid * VID->vid cpp * fix OLS build (#428) * fix OLS build * Add a build to CI with feature flags enabled * In Memory Graph Store (#395) * inmem_graph_store initial impl * barebones of in mem graph store * refactoring index to use index factory * clang format fix * making enum to enum class (c++ 11 style) for scope resolution with same enum values * cleaning up API for GraphSore * moving _nd back to index class * resolving PR comments * error fix * error fix for dynamic * resolving PR comments * removing _num_frozen_point from graph store * minor fix * moving _start back to main + minor update in graph store api to support that * adding requested changes from Gopal * removing reservations * resolving namespace resolution for defaults after build failure * minor update * minor update * speeding up location update logic while repositioning * updated with reserving mem for graph neighbours upfront * build error fix * minor update in assert * initial commit * updating python bindings to use new ctor * python binding error fix * error fix * reverting some changes -> experiment * removing redundnt code from native index * python build error fix * tyring to resolve python build error * attempt at python build fix * adding IndexSearchParams * setting search threads to non zero * minor check removed * eperiment 3-> making distance fully owned by data_store * exp 3 clang fix * exp 4 * making distance as unique_ptr * trying to fix build * finally fixing problem * some minor fix * adding dll export to index_factory static function * adding dll export for static fn in index_factory * code cleanup * resolving errors after merge * resolving build errors * fixing build error for stitched index * resolving build errors * removing max_observed_degree set() * removing comments + typo fix * replacing add_neighbour with set_neighbours where we can * error fix * Undo mistake, let frontier read in PQ flash index be asynchronous (#434) * Undo mistake, let frontier read in PQ flash index be asynchronous * address changes requested * Reduce CI tests for multi-sector disk layout from 10K to 5K points so… (#439) * Reduce CI tests for multi-sector disk layout from 10K to 5K points so they run faster * turn off 1024D * hot fix definate mem_leaks (#440) * add num_Threads to indexwriteparams in sharded build (#438) * Added clarity to the universal label (#442) * Remove IndexWriteParams from build method. (#441) * removing write_params from buidl and taking it upfront in Index Ctor * renaming build_params to filter params * Type hints and returns actually align this time. (#444) * working draft PR for cleaning up disk based filter search (#414) * made changes to clean up filter number conversion, and fixed bug with universal filter search * minor typecast fix --------- Co-authored-by: rakri <rakri@microsoft.com> * Fixes #432, bug in using openmp with gcc and omp_get_num_threads() (#445) * Fixes #432, bug in using openmp with gcc and omp_get_num_threads() only reporting the number of threads collaborating on the current code region not available overall. I made this error and transitioned us from omp_get_num_procs() about 5 or 6 months ago and only with bug #432 did I really get to see how problematic my naive expectations were. * Removed cosine distance metric from disk index until we can properly fix it in pqflashindex. Documented what distance metrics can be used with what vector dtypes in tables in the documentation. * Preparing for 0.6.1 release (#447) * Release documentation from the release tag instead of main (#448) * Build streaming index of labeled data (#376) * Add bool param for building a graph of labeled data * Add arguments for building labeled index * Pass arguments for labeled index * Light renaming * Handle labels in insert_point * Fix missing semicolon * Add initial label handling logic * Use unlabeled algo for uniquely labeled point * Ignore frozen points when checking labels * Fix missing newline * Move label-specific logic to threadsafe zone * Check for frozen points when assert num points and num labeled points * Fix file name concatenation for label metadata * inmem_graph_store initial impl * Use Lbuild to append to pruned_list during filter build * Add label counts for deleting from streaming index * Fix typo * Fix conditions for testing * Add medoid search to support deleting label medoids from graph * resolvig error with bfs_medoid_search() * trying to create 2 pruned_lists and combine them * Clear pool between calls to search_for_point_and_prune. Fix integer math * Update pruned_list algo for link method * making fz_points to be medoids for labels encountered * repositioning medoids as well because they are fz points when compacting data * removing unrequired method * rebasing from main * adding tests in yml workflow for dynamic index with labels * quick fix * removing combining of unfiltered + filtered list for now * trying to resolve disk search poor performance * incleasing L size while searching disk index * minor roolback * updating dynamic-label to not use tag file while computing GT * altering some test search L values * adding unfiltered search for filtered batch build index * adding compute gt for zipf dist labels in labsls wowrkflow * searching filtered streaming index with popular label for now * reposition fz points as medoids for filtered dynamic build * minor renaming vars * seoparate functio for insert opoint with labels and without labels * clang error fix * barebones of in mem graph store * refactoring index to use index factory * clang format fix * window build fix * making enum to enum class (c++ 11 style) for scope resolution with same enum values * cleaning up API for GraphSore * resolving comments * clang error fix * adding some comments * moving _nd back to index class * removing funcrion reposition medoidds its not required, incorporated into reposition_points * altering -L (32->5) and -R (16->32) whhile building filterted disk index to work well with modified connections in algo * updating docs -> dynamic_index.md to have info on how to build and search filtered dynamic index * updating docs * updateing _pts_to_labels when repositioning fz_points * error fix * clang fix * making sure _pts_to_labels are not empty * fixing dynamic-label build error * code improvements * adding logic for test_ins_del_consolidate to support filtered index * resolving PR comments * error fix * error fix for dynamic * now test insert delete consolidate support building filters * lowering recal in case of test insert delete consolidte * resolving PR comments * removing _num_frozen_point from graph store * minor fix * moving _start back to main + minor update in graph store api to support that * adding a lock before detect_common_filter + minor naming improvement * adding requested changes from Gopal * removing reservations * resolving namespace resolution for defaults after build failure * minor update * minor update * speeding up location update logic while repositioning * updated with reserving mem for graph neighbours upfront * build error fix * minor update in assert * initial commit * updating python bindings to use new ctor * python binding error fix * error fix * reverting some changes -> experiment * removing redundnt code from native index * python build error fix * tyring to resolve python build error * attempt at python build fix * adding IndexSearchParams * setting search threads to non zero * minor check removed * eperiment 3-> making distance fully owned by data_store * exp 3 clang fix * exp 4 * making distance as unique_ptr * trying to fix build * finally fixing problem * some minor fix * adding dll export to index_factory static function * adding dll export for static fn in index_factory * code cleanup * resolving errors after merge * resolving build errors * fixing build error for stitched index * resolving build errors * removing max_observed_degree set() * removing comments + typo fix * replacing add_neighbour with set_neighbours where we can * error fix * minor fix * fixing error introduced while rebasing * fixing error for dynamic filtered index * resolving dynamic build deadlick error * resolving error with test_insert_del_consolidate for dynamic filter build * minor code cleanup * refactoring fz_pts and filter_index to be property of IndexConfig and hence Index * removing write_params from build() * removing write_params from buidl and taking it upfront in Index Ctor * minor fix * renaming build_params to filter params * fixing errors on auto merge * auto decide universal_label experiment * resolving bug with universal lable * resolving dynamic labels error, if there are unused fz points * exposing set_universal_label() through abstract index * minor update: sanity check * minor update to search * including tag file while computing GT * generating compacted label file and using it in generate GT * minor fix * resolving New PR comments (minor typo fixes) * renaming _pts_to_labels to _tag_to_labels + adding a warning for consolidate deletes and quality of index * minor name chnage + code cleanup * clang format fix * adding locks for filter data_structures * avoiding deadock * universal label defination update * reverting locks on _location_to_labels as its causing problems with large dataset * adding locks for _label_to_medoid_id * Update dynamic_index.md * Update dynamic-labels.yml * renaming some variables --------- Co-authored-by: David Kaczynski <dkaczynski@microsoft.com> Co-authored-by: yashpatel007 <patelyash1311@gmail.com> Co-authored-by: Yash Patel <47032340+yashpatel007@users.noreply.github.com> Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com> * Fix typo in SSD_index.md (#466) * add check for .enc extension to support encryption (#467) * add check for .enc extension to support encryption * check rotation_matrix file in file blobs * read from MemoryMappedFile when EXEC_ENV_OLS is defined (#471) * read from MemoryMappedFile when EXEC_ENV_OLS is defined * fix is_open/close which stringstream does not have * fix formating to comply with clang * fix labels.yml: create tmp directory before search_diskk_index is run * fix to reset stream after reads * rename 'content' variable to avoid duplicates (#475) * read file in one time (#460) * read whole label file to memory, use string find instead stringstream * format doc * Bump rustix from 0.37.20 to 0.37.25 in /rust (#479) Bumps [rustix](https://github.com/bytecodealliance/rustix) from 0.37.20 to 0.37.25. - [Release notes](https://github.com/bytecodealliance/rustix/releases) - [Commits](https://github.com/bytecodealliance/rustix/compare/v0.37.20...v0.37.25) --- updated-dependencies: - dependency-name: rustix dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * correct index_path_prefix in __init__ function of static disk index (#483) * Adding Filtered Index support to Python bindings (#482) * Halfway approach to the new indexfactory, but it doesn't have the same featureset as the old way. Committing this for posterity but reverting my changes ultimately * Revert "Halfway approach to the new indexfactory, but it doesn't have the same featureset as the old way. Committing this for posterity but reverting my changes ultimately" This reverts commit 03dccb599449881f64664a10b397a790a7d00985. * Adding filtered search. API is going to change still. * Further enhancements to the new filter capability in the static memory index. * Ran automatic formatting * Fixing my logic and ensuring the unit tests pass. * Setting this up as a rc build first * list[list[Hashable]] -> list[list[str]] * Adding halfway to a solution where we query for more items than exist in the filter set. We need to replicate this behavior across all indices though - dynamic, static disk and memory w/o filters, etc * Removing the import of Hashable too * Fixing index_prefix_path bug in python for StaticMemoryIndex (#491) * Fixing the same bug I had in static disk index inside of static memory index as well. * Unit tests and a better understanding of why the unit tests were successful despite this bug * Handle io_setup error properly (#465) * Address race condition in `iterate_to_fixed_point` (#478) Co-authored-by: Siddharth Gollapudi <t-gollapudis@microsoft.com> * Use TCMalloc to fix system memory leak (#494) * add fix for memory leak * cmake change for enable tcmalloc * add hot fix for cmake for boost and tcmalloc * fix indentation * identitation * change camke set on after cmake_minimum_required * unset tcmalloc for PYBIND * unset envirvariable beforehead * set off * exlucde the compile def for pybind * disable for pybind * Adding a new PQ Distance Metric and PQ Data Store (#384) * Added PQ distance hierarchy Changes to CMakelists PQDataStore version that builds correctly Clang-format * Fixing compile issues after rebase to main * minor renaming functions * fixed small bug post rebasing with index factory * Changes to index factory to support PQDataStore * Merged graph_store and pq_data_store * Implementing preprocessing for inmemdatastore * Incorporating code review comments * minor bugfix for PQ data allocation * clang-formatted * Incorporating CR comments * Fixing compile error * minor bug fix + clang-format * Update pq.h * Fixing warnings about struct/class incompatibility --------- Co-authored-by: Gopal Srinivasa <gopalsr@microsoft.com> Co-authored-by: ravishankar <rakri@microsoft.com> Co-authored-by: gopalrs <33950290+gopalrs@users.noreply.github.com> * Bump zerocopy from 0.6.1 to 0.6.6 in /rust (#499) Bumps [zerocopy](https://github.com/google/zerocopy) from 0.6.1 to 0.6.6. - [Release notes](https://github.com/google/zerocopy/releases) - [Changelog](https://github.com/google/zerocopy/blob/main/CHANGELOG.md) - [Commits](https://github.com/google/zerocopy/commits) --- updated-dependencies: - dependency-name: zerocopy dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix calculation of current_point_offset in test_insert_consolidate_deletes (#501) The program builds the streaming index after two optional steps: 1) skipping S points from the input file and 2) batch building of initial index using B points from the input file. After these two steps, the offset to the input file should be S + B, but the current code first sets it to S in line 163 then overwrites it to B in line 249, instead of adding B to the offset. The tool which `test_insert_deletes_consolidate` was based on was using `+=` in the modified line. * add16bytes tag type (#506) * add 16 bytes tag type * clean up code * format doc * fix compile issue * fix compile issue * revert change * format doc * separate static search and streaming search * clean up code * resolve comment * format doc * fix test * resolve comment * Rakri/cosine bug fix (#450) * compiles, but need to verify * fixed windows compiler warning * minor typo * added cosine unit test with unnormalized data * minor typo in user prompt cosine/l2 * cosine was already supported in groundtruth, edited the message to say so * clang-format --------- Co-authored-by: rakri <rakri@microsoft.com> * Version bump 0.7.0rc2->0.7.0 (#510) * Version bump 0.7.0rc2->0.7.0 Preparing diskannpy for 0.7.0 release (filter support, static memory indices only) * Update pyproject.toml the GPG key from (presumably) 2019 is no longer valid * Update pyproject.toml * Update python-release.yml By default, GITHUB_TOKEN no longer has write permissions - you have to explicitly ask for it in the specific job that needs it. We use write permissions to update the Github release action that updates the published build artifacts with the results of the release flow. * Allow documentation to be published to our gh-pages branch (#511) * Update push-test.yml (#512) * Bug fix for dlvs (#509) * Fix small bugs for DLVS path. * Easier for user to use. --------- Co-authored-by: REDMOND\ninchen <ninchen@microsoft.com> * add wait() method to AlignedFileReader (#518) * Add simplified functions for product quantization (#514) * Add simplified functions for product quantization * Fixing formatting errors * Fixing clang-format issue * Fixing another set of clang-format issues --------- Co-authored-by: Michael Popov (from Dev Box) <mipopo@microsoft.com> * Create in memory data store/graph store with at least max_points as 1 (#523) * create in memory data store/graph store with at least max_points as 1 * fix code formatting * replace callback driven wait with new Wait() method (#526) * wait on completeCount if callback is used (#532) * Fix PQScratch memory leak (#522) * fix memory leak * FIXED clang-format error * FIXED SSDQueryScratch Destroy OOM * fix compile issue * add interface * add interface * change inteface * move function to public * remove hard code unv label num * fix convert issue * fix some issue * Bump openssl from 0.10.55 to 0.10.60 in /rust (#496) Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.55 to 0.10.60. - [Release notes](https://github.com/sfackler/rust-openssl/releases) - [Commits](https://github.com/sfackler/rust-openssl/compare/openssl-v0.10.55...openssl-v0.10.60) --- updated-dependencies: - dependency-name: openssl dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix issues * fix issues * tune perf * test remove lock * try shared lock * change to shared lock * try perfetch * fix some issues * fix issue * skip unfilter search while Lindex = 1 * reserve queue size with max search lsit * revert change * revert change * clean up code --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: jinwei14 <janviezhang14@gmail.com> Co-authored-by: Yash Patel <47032340+yashpatel007@users.noreply.github.com> Co-authored-by: patelyash <patelyash@microsoft.com> Co-authored-by: Varun Sivashankar <t-varunsi@microsoft.com> Co-authored-by: David Kaczynski <dkaczynski@gmail.com> Co-authored-by: ravishankar <rakri@microsoft.com> Co-authored-by: David Kaczynski <dkaczynski@microsoft.com> Co-authored-by: Siddharth Gollapudi <t-gollapudis@microsoft.com> Co-authored-by: Neelam Mahapatro <nmahapatro@microsoft.com> Co-authored-by: Harsha Vardhan Simhadri <harshasi@microsoft.com> Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com> Co-authored-by: Dax Pryce <daxpryce@microsoft.com> Co-authored-by: Jakub Tarnawski <jakub.tarnawski@microsoft.com> Co-authored-by: Yiyong Lin <lyysdy@foxmail.com> Co-authored-by: Yiyong Lin <yiyolin@microsoft.com> Co-authored-by: Andrija Antonijevic <theantony@users.noreply.github.com> Co-authored-by: Neelam Mahapatro <37527155+NeelamMahapatro@users.noreply.github.com> Co-authored-by: harsha vardhan simhadri <harsha.v.simhadri@gmail.com> Co-authored-by: gopalrs <33950290+gopalrs@users.noreply.github.com> Co-authored-by: Gopal Srinivasa <gopalsr@microsoft.com> Co-authored-by: yashpatel007 <patelyash1311@gmail.com> Co-authored-by: nicaurvi <nyecarr@gmail.com> Co-authored-by: Nick Caurvina <nicaurvi@microsoft.com> Co-authored-by: Varun Sivashankar <44419819+varunsivashankar@users.noreply.github.com> Co-authored-by: rakri <78582691+rakri@users.noreply.github.com> Co-authored-by: varat73 <124637813+varat73@users.noreply.github.com> Co-authored-by: JieCin <1875919175@qq.com> Co-authored-by: Shengjie Qian <shenqian@microsoft.com> Co-authored-by: Jon McLean <4429525+jonmclean@users.noreply.github.com> Co-authored-by: Jon McLean <none@example.com> Co-authored-by: Jonathan McLean <Jonathan.McLean@microsoft.com> Co-authored-by: litan1 <106347144+ltan1ms@users.noreply.github.com> Co-authored-by: Philip Adams <35666630+PhilipBAdams@users.noreply.github.com> Co-authored-by: Shawn Zhong <github@shawnzhong.com> Co-authored-by: Huisheng Liu <hliu@microsoft.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Xiangyu Wang <wxyucs@gmail.com> Co-authored-by: Siddharth Gollapudi <siddharthgollapudi99@gmail.com> Co-authored-by: NingyuanChen <chenningyuan008@hotmail.com> Co-authored-by: REDMOND\ninchen <ninchen@microsoft.com> Co-authored-by: Michael Popov <mpopov2012@gmail.com> Co-authored-by: Michael Popov (from Dev Box) <mipopo@microsoft.com> Co-authored-by: luyuncheng <luyuncheng@bytedance.com>
What does this implement/fix? Briefly explain your changes.
User may need to parse index graph information into different format for different applications. This PR gives an API for getting the neighborhood of a given vector, and that vector's PQ and full-dimensional data.
Any other comments?