Skip to content

Commit

Permalink
Add XTR Support (#27)
Browse files Browse the repository at this point in the history
This PR adds preliminary support for running XTR models in LintDB.

- Tokenizer runs sentencepiece and adds an eos token for XTR.
- Inverted Lists change to index codes per token for XTR. Therefore, XTR
makes one less database call to score.
- ProductEncoder handles quantization, with the help of new inverted
list scanners and distance tables.

This touches a lot of code to help find better abstractions.
  • Loading branch information
mtbarta committed Jun 11, 2024
1 parent 6bc71d3 commit f9b3364
Show file tree
Hide file tree
Showing 92 changed files with 3,283 additions and 1,133 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
debug/
target/
assets/
cmake-build-debug/

.DS_Store
# Remove Cargo.lock from gitignore if creating an executable, leave it for libraries
Expand Down
8 changes: 8 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions .idea/.name

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions .idea/LintDB.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions .idea/codeStyles/Project.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions .idea/codeStyles/codeStyleConfig.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions .idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 9 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ build-python-mac:
cd builds/python/lintdb/python && python setup.py build

test:
cd builds/debug && cmake -E env GLOG_logtostderr=1 MKL_THREADING_LAYER=GNU ctest --output-on-failure
cd builds/debug && cmake -E env GLOG_v=5 GLOG_logtostderr=1 MKL_THREADING_LAYER=GNU ctest --output-on-failure

test-python: build-python
# had to fix up conda to make this work--
Expand All @@ -44,10 +44,10 @@ format:

valgrind:
# we need valgrind?-3.20 to process dwarf5
valgrind -s --trace-children=yes --track-origins=yes --keep-stacktraces=alloc-and-free --suppressions=debug/valgrind-python.supp env PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" python benchmarks/bench_lintdb.py --index-path=experiments/py_index_bench_colbert-lifestyle-2024-04-03
valgrind -s --trace-children=yes --track-origins=yes --keep-stacktraces=alloc-and-free --suppressions=debug/valgrind-python.supp env PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" python benchmarks/bench_lintdb.py --index-path=experiments/py_index_bench_test-collection-xtr

callgrind: build-conda
OMP_MAX_ACTIVE_LEVELS=2 OMP_THREAD_LIMIT=6 OMP_NUM_THREADS=6 PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" valgrind --tool=callgrind --suppressions=debug/valgrind-python.supp --instr-atstart=yes --dump-instr=yes --collect-jumps=yes python ./benchmarks/bench_lintdb.py
callgrind:
OMP_MAX_ACTIVE_LEVELS=2 OMP_THREAD_LIMIT=6 OMP_NUM_THREADS=6 PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" valgrind --tool=callgrind --suppressions=debug/valgrind-python.supp --instr-atstart=yes --dump-instr=yes --collect-jumps=yes python ./benchmarks/bench_lintdb.py single-search

callgrind-colbert: build-conda
PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" valgrind --tool=callgrind --suppressions=debug/valgrind-python.supp --instr-atstart=no --dump-instr=yes --collect-jumps=yes python ./benchmarks/run_colbert.py
Expand All @@ -73,6 +73,9 @@ build-conda:
-DBUILD_TESTING=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DBLA_VENDOR=Intel10_64lp \
-DOpenMP_CXX_FLAGS=-fopenmp=libiomp5 \
-DOpenMP_CXX_LIB_NAMES=libiomp5 \
-DOpenMP_libiomp5_LIBRARY=${ROOT_DIR}/_build_python_/vcpkg_installed/x64-linux/lib/intel64/libiomp5.so \
.

cmake --build _build_python_${PY_VER} --target pylintdb -j12
Expand All @@ -92,7 +95,7 @@ build-benchmarks:
.
CC=gcc CXX=g++ CMAKE_C_COMPILER=gcc CMAKE_CXX_COMPILER=g++ cmake --build build_benchmarks --target=bench_lintdb -j12

run-perf: build-conda
run-perf:
# make sure your system allows perf to run. ex: sudo sysctl -w kernel.perf_event_paranoid=1
OMP_MAX_ACTIVE_LEVELS=2 OMP_THREAD_LIMIT=12 OMP_NUM_THREADS=6 PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" perf record -g -- /home/matt/miniconda3/envs/lintdb-benchmark/bin/python -X perf benchmarks/bench_lintdb.py
OMP_MAX_ACTIVE_LEVELS=2 OMP_THREAD_LIMIT=12 OMP_NUM_THREADS=6 PYTHONPATH="builds/python/lintdb/python/build/lib/lintdb" perf record -g -- /home/matt/miniconda3/envs/lintdb-benchmark/bin/python -X perf benchmarks/run_lintdb.py
perf script | ./debug/stackcollapse-perf.pl | ./debug/flamegraph.pl > perf.data.svg
4 changes: 2 additions & 2 deletions benchmarks/bench_lintdb.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
#include "lintdb/index_builder/Tokenizer.h"

static void BM_lintdb_search(benchmark::State& state) {
// std::string path = "/mnt/data/py_index_bench_colbert-lifestyle-2024-04-16-pq";
std::string path = "experiments/py_index_bench_colbert-lifestyle-2024-04-03";
std::string path = "experiments/py_index_bench_test-collection-xtr";
// std::string path = "experiments/py_index_bench_colbert-lifestyle-2024-04-03";
lintdb::IndexIVF index(path);
for (auto _ : state) {
state.PauseTiming();
Expand Down
12 changes: 6 additions & 6 deletions benchmarks/bench_lintdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def callgrind_dump_stats(path:str):
app = typer.Typer()

@app.command()
def single_search(dataset:str='lifestyle', split:str='dev',profile=False, checkpoint:str='colbert-ir/colbertv2.0', index_path:str='experiments/py_index_bench_colbert-lifestyle-2024-04-03'):
def single_search(dataset:str='lifestyle', split:str='dev',profile=False, checkpoint:str='colbert-ir/colbertv2.0', index_path:str='experiments/py_index_bench_test-collection-xtr'):
latencies = []
memory = []

Expand All @@ -38,8 +38,8 @@ def single_search(dataset:str='lifestyle', split:str='dev',profile=False, checkp
converted = embeddings

start = time.perf_counter()
# if profile:
# callgrind_start_instrumentation()
if profile:
callgrind_start_instrumentation()
opts = ldb.SearchOptions()
results = index.search(
0,
Expand All @@ -49,9 +49,9 @@ def single_search(dataset:str='lifestyle', split:str='dev',profile=False, checkp
opts
)
latencies.append((time.perf_counter() - start)*1000)
# if profile:
# callgrind_stop_instrumentation()
# callgrind_dump_stats("callgrind.out.single_search")
if profile:
callgrind_stop_instrumentation()
callgrind_dump_stats("callgrind.out.single_search")
memory.append(get_memory_usage())
rankings[id] = [x.id for x in results]
count+=1
Expand Down
22 changes: 15 additions & 7 deletions benchmarks/lotte/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,23 +61,24 @@ def colbert_indexing(experiment: str, exp_path: str, dataset: LoTTeDataset, nbit
def lintdb_search(
experiment: str,
exp_path: str,
dataset:LoTTeDataset,
k,
nbits=2,
dataset:LoTTeDataset,
checkpoint: str = "colbert-ir/colbertv2.0",
reuse_centroids=True,
use_compression=False,
failures={}):
failures={},
use_xtr: bool = False,
):
# let's get the same model.
config = ColBERTConfig.load_from_checkpoint(checkpoint)
config.kmeans_niters=4
config.ncells = 2
config.ndocs=1024
config.centroid_score_threshold=.45

from colbert.modeling.checkpoint import Checkpoint
from colbert import Searcher
checkpoint = Checkpoint(checkpoint, config)
if not use_xtr:
from colbert.modeling.checkpoint import Checkpoint
from colbert import Searcher
checkpoint = Checkpoint(checkpoint, config)

index_path = f"{exp_path}/py_index_bench_{experiment}"
if not os.path.exists(index_path):
Expand All @@ -96,6 +97,7 @@ def lintdb_search(
failure_ids=set()
if failures:
failure_ids = set(failures.keys())
count=0
for id, query in zip(dataset.qids, dataset.queries):
if failures and id not in failure_ids:
continue
Expand Down Expand Up @@ -126,12 +128,18 @@ def lintdb_search(
opts
)
else:
opts = ldb.SearchOptions()
opts.k_top_centroids = 1000
results = index.search(
0,
converted,
64, # nprobe
100, # k to return
opts
)
count+=1
# if count == 2:
# return
for rank, result in enumerate(results):
# qid, pid, rank
f.write(f"{id}\t{result.id}\t{rank+1}\t{result.score}\n")
Expand Down
Loading

0 comments on commit f9b3364

Please sign in to comment.