Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XTR Support #27

Merged
merged 6 commits into from
Jun 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
debug/
target/
assets/
cmake-build-debug/

.DS_Store
# Remove Cargo.lock from gitignore if creating an executable, leave it for libraries
Expand Down
8 changes: 8 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions .idea/.name

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions .idea/LintDB.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions .idea/codeStyles/Project.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions .idea/codeStyles/codeStyleConfig.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions .idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 10 additions & 0 deletions .idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 9 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ build-python-mac:
cd builds/python/lintdb/python && python setup.py build

test:
cd builds/debug && cmake -E env GLOG_logtostderr=1 MKL_THREADING_LAYER=GNU ctest --output-on-failure
cd builds/debug && cmake -E env GLOG_v=5 GLOG_logtostderr=1 MKL_THREADING_LAYER=GNU ctest --output-on-failure

test-python: build-python
# had to fix up conda to make this work--
Expand All @@ -44,10 +44,10 @@ format:

valgrind:
# we need valgrind?-3.20 to process dwarf5
valgrind -s --trace-children=yes --track-origins=yes --keep-stacktraces=alloc-and-free --suppressions=debug/valgrind-python.supp env PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" python benchmarks/bench_lintdb.py --index-path=experiments/py_index_bench_colbert-lifestyle-2024-04-03
valgrind -s --trace-children=yes --track-origins=yes --keep-stacktraces=alloc-and-free --suppressions=debug/valgrind-python.supp env PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" python benchmarks/bench_lintdb.py --index-path=experiments/py_index_bench_test-collection-xtr

callgrind: build-conda
OMP_MAX_ACTIVE_LEVELS=2 OMP_THREAD_LIMIT=6 OMP_NUM_THREADS=6 PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" valgrind --tool=callgrind --suppressions=debug/valgrind-python.supp --instr-atstart=yes --dump-instr=yes --collect-jumps=yes python ./benchmarks/bench_lintdb.py
callgrind:
OMP_MAX_ACTIVE_LEVELS=2 OMP_THREAD_LIMIT=6 OMP_NUM_THREADS=6 PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" valgrind --tool=callgrind --suppressions=debug/valgrind-python.supp --instr-atstart=yes --dump-instr=yes --collect-jumps=yes python ./benchmarks/bench_lintdb.py single-search

callgrind-colbert: build-conda
PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" valgrind --tool=callgrind --suppressions=debug/valgrind-python.supp --instr-atstart=no --dump-instr=yes --collect-jumps=yes python ./benchmarks/run_colbert.py
Expand All @@ -73,6 +73,9 @@ build-conda:
-DBUILD_TESTING=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DBLA_VENDOR=Intel10_64lp \
-DOpenMP_CXX_FLAGS=-fopenmp=libiomp5 \
-DOpenMP_CXX_LIB_NAMES=libiomp5 \
-DOpenMP_libiomp5_LIBRARY=${ROOT_DIR}/_build_python_/vcpkg_installed/x64-linux/lib/intel64/libiomp5.so \
.

cmake --build _build_python_${PY_VER} --target pylintdb -j12
Expand All @@ -92,7 +95,7 @@ build-benchmarks:
.
CC=gcc CXX=g++ CMAKE_C_COMPILER=gcc CMAKE_CXX_COMPILER=g++ cmake --build build_benchmarks --target=bench_lintdb -j12

run-perf: build-conda
run-perf:
# make sure your system allows perf to run. ex: sudo sysctl -w kernel.perf_event_paranoid=1
OMP_MAX_ACTIVE_LEVELS=2 OMP_THREAD_LIMIT=12 OMP_NUM_THREADS=6 PYTHONPATH="_build_python_/lintdb/python/build/lib/lintdb" perf record -g -- /home/matt/miniconda3/envs/lintdb-benchmark/bin/python -X perf benchmarks/bench_lintdb.py
OMP_MAX_ACTIVE_LEVELS=2 OMP_THREAD_LIMIT=12 OMP_NUM_THREADS=6 PYTHONPATH="builds/python/lintdb/python/build/lib/lintdb" perf record -g -- /home/matt/miniconda3/envs/lintdb-benchmark/bin/python -X perf benchmarks/run_lintdb.py
perf script | ./debug/stackcollapse-perf.pl | ./debug/flamegraph.pl > perf.data.svg
4 changes: 2 additions & 2 deletions benchmarks/bench_lintdb.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
#include "lintdb/index_builder/Tokenizer.h"

static void BM_lintdb_search(benchmark::State& state) {
// std::string path = "/mnt/data/py_index_bench_colbert-lifestyle-2024-04-16-pq";
std::string path = "experiments/py_index_bench_colbert-lifestyle-2024-04-03";
std::string path = "experiments/py_index_bench_test-collection-xtr";
// std::string path = "experiments/py_index_bench_colbert-lifestyle-2024-04-03";
lintdb::IndexIVF index(path);
for (auto _ : state) {
state.PauseTiming();
Expand Down
12 changes: 6 additions & 6 deletions benchmarks/bench_lintdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def callgrind_dump_stats(path:str):
app = typer.Typer()

@app.command()
def single_search(dataset:str='lifestyle', split:str='dev',profile=False, checkpoint:str='colbert-ir/colbertv2.0', index_path:str='experiments/py_index_bench_colbert-lifestyle-2024-04-03'):
def single_search(dataset:str='lifestyle', split:str='dev',profile=False, checkpoint:str='colbert-ir/colbertv2.0', index_path:str='experiments/py_index_bench_test-collection-xtr'):
latencies = []
memory = []

Expand All @@ -38,8 +38,8 @@ def single_search(dataset:str='lifestyle', split:str='dev',profile=False, checkp
converted = embeddings

start = time.perf_counter()
# if profile:
# callgrind_start_instrumentation()
if profile:
callgrind_start_instrumentation()
opts = ldb.SearchOptions()
results = index.search(
0,
Expand All @@ -49,9 +49,9 @@ def single_search(dataset:str='lifestyle', split:str='dev',profile=False, checkp
opts
)
latencies.append((time.perf_counter() - start)*1000)
# if profile:
# callgrind_stop_instrumentation()
# callgrind_dump_stats("callgrind.out.single_search")
if profile:
callgrind_stop_instrumentation()
callgrind_dump_stats("callgrind.out.single_search")
memory.append(get_memory_usage())
rankings[id] = [x.id for x in results]
count+=1
Expand Down
22 changes: 15 additions & 7 deletions benchmarks/lotte/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,23 +61,24 @@ def colbert_indexing(experiment: str, exp_path: str, dataset: LoTTeDataset, nbit
def lintdb_search(
experiment: str,
exp_path: str,
dataset:LoTTeDataset,
k,
nbits=2,
dataset:LoTTeDataset,
checkpoint: str = "colbert-ir/colbertv2.0",
reuse_centroids=True,
use_compression=False,
failures={}):
failures={},
use_xtr: bool = False,
):
# let's get the same model.
config = ColBERTConfig.load_from_checkpoint(checkpoint)
config.kmeans_niters=4
config.ncells = 2
config.ndocs=1024
config.centroid_score_threshold=.45

from colbert.modeling.checkpoint import Checkpoint
from colbert import Searcher
checkpoint = Checkpoint(checkpoint, config)
if not use_xtr:
from colbert.modeling.checkpoint import Checkpoint
from colbert import Searcher
checkpoint = Checkpoint(checkpoint, config)

index_path = f"{exp_path}/py_index_bench_{experiment}"
if not os.path.exists(index_path):
Expand All @@ -96,6 +97,7 @@ def lintdb_search(
failure_ids=set()
if failures:
failure_ids = set(failures.keys())
count=0
for id, query in zip(dataset.qids, dataset.queries):
if failures and id not in failure_ids:
continue
Expand Down Expand Up @@ -126,12 +128,18 @@ def lintdb_search(
opts
)
else:
opts = ldb.SearchOptions()
opts.k_top_centroids = 1000
results = index.search(
0,
converted,
64, # nprobe
100, # k to return
opts
)
count+=1
# if count == 2:
# return
for rank, result in enumerate(results):
# qid, pid, rank
f.write(f"{id}\t{result.id}\t{rank+1}\t{result.score}\n")
Expand Down
Loading
Loading