Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors

This is the code for the current state-of-the-art billion-scale nearest neighbor search system presented in the paper:

Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors,
Dmitry Baranchuk, Artem Babenko, Yury Malkov

The code is developed upon the FAISS library.

Build

Today we provide the C++ implementation supporting only the CPU version, which requires a BLAS library.

The code requires a C++ compiler that understands:

the Intel intrinsics for SSE instructions
the GCC intrinsic for the popcount instruction
basic OpenMP

Installation instructions

Clone repository

git clone https://github.com/dbaranchuk/ivf-hnsw --recursive

Configure FAISS

There are a few models for makefile.inc in the faiss/example_makefiles/ subdirectory. Copy the relevant one for your system to faiss/ and adjust to your needs. In particular, for ivf-hnsw project, you need to set a proper BLAS library paths. There are also indications for specific configurations in the troubleshooting section of the FAISS wiki

Replace FAISS CMakeList.txt

Replace faiss/CMakeList.txt with CMakeList.txt.faiss in order to deactivate building of unnecessary tests and the GPU version.

mv CMakeLists.txt.faiss faiss/CMakeLists.txt

Build project

cmake . && make

Data

The proposed methods are tested on two 1 billion datasets: SIFT1B and DEEP1B. For using provided examples, all data files have to be in data/SIFT1B and data/DEEP1B.

Data files:

SIFT1B:
- dataset, Datasets for approximate nearest neighbor search
cd data/SIFT1B && bash load_sift1b.sh
- learned 993127 centroids, GoogleDrive
- precomputed indices of assigned base points, GoogleDrive
DEEP1B:
- dataset, YandexDrive
cd data/DEEP1B && python load_deep1b.py
- learned 999973 centroids, GoogleDrive
- precomputed indices of assigned base points, GoogleDrive

Note: precomputed indices are optional, as it just lets avoid assigning step, which takes about 2-3 days for 2^20 centroids.

Run

tests/ provides two tests for each dataset:

IVFADC
IVFADC + Grouping (+ Pruning)

Each test requires many options, so we provide bash scripts in examples/, exploiting these tests. Scripts are commented and the Parser class provides short descriptions for each option.

Make sure that:

models/SIFT1B/ and models/DEEP1B/ exist

mkdir models && mkdir models/SIFT1B && mkdir models/DEEP1B

the data is placed to data/SIFT1B/ and data/DEEP1B/ respectively (or just make symbolic links)
run, for example:

bash examples/run_deep1b_grouping.sh

Documentation

The doxygen documentation gives per-class information

Name		Name	Last commit message	Last commit date
Latest commit History 3,595 Commits
data		data
docs		docs
examples		examples
faiss @ 4fe9204		faiss @ 4fe9204
hnswlib		hnswlib
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CMakeLists.txt.faiss		CMakeLists.txt.faiss
IndexIVF_HNSW.cpp		IndexIVF_HNSW.cpp
IndexIVF_HNSW.h		IndexIVF_HNSW.h
IndexIVF_HNSW_Grouping.cpp		IndexIVF_HNSW_Grouping.cpp
IndexIVF_HNSW_Grouping.h		IndexIVF_HNSW_Grouping.h
LICENSE		LICENSE
Parser.h		Parser.h
README.md		README.md
utils.cpp		utils.cpp
utils.h		utils.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors

Build

Installation instructions

Data

Data files:

Run

Documentation

About

Releases

Packages

Languages

License

wreally/ivf-hnsw

Folders and files

Latest commit

History

Repository files navigation

Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors

Build

Installation instructions

Data

Data files:

Run

Documentation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages