This is Github repository for our paper Quam: Adaptive Retrieval through Query Affinity Modelling accepted in 18th ACM International Conference on Web Search and Data Mining (WSDM 2025), Hannover, Germany, 10-14 Mar 2025.
We have added all dependencies in requirements.txt file which can be downloaded as follows:
pip install --upgrade git+https://github.com/terrierteam/pyterrier_t5.git
pip install --upgrade git+https://github.com/terrierteam/pyterrier_adaptive.git
pip install pyterrier_pisa==0.0.6
├── data/
│ ├── laff_train_data/
├── laff_model/
├── saved_pyterrier_runs/
│ ├── gbm25/
│ ├── gtcthnp/
├── base_models.py
├── dataset_utils.py
├── gar_aff.py
├── laff.py
├── pyterrier_quam.py
├── quam_main_fig.jpg
├── requirements.txt
├── run.py
├── train_laff.py
└── README.md
We use the same corpus graph from the GAR paper and we release our laff scores based affinity graph. For instance, the bm25 based corpus and affinity graph can be downloaded using:
import pyterrier_alpha as pta
corpus_graph = pta.Artifact.from_hf('macavaney/msmarco-passage.corpusgraph.bm25.128')
laff_graph = pta.Artifact.from_hf('macavaney/msmarco-passage.corpusgraph.bm25.128.laff')
If you want to generate own training dataset and train the affinity model, the training data can be created using the laff.py
file. Alternatively, we release the training dataset for learnt affinity model from Huggingface. The dataset has following files:
- data-00000-of-00001.arrow
- dataset_info.json
- state.json
Please download all 3 files in the data/laff_train_data/
folder. Further, the dataset can be loaded as
import datasets
ds = datasets.load_from_disk("data/laff_train_data")
If you want to use the learnt affinity model for document-document similarity, you can train using the train_laff.py
file. Alternatively, we have released the model's weights and can be downloaded from huggingface. The corresponding file is:
- bert-base-laff.pth
Alternatively, use the following code:
from huggingface_hub import hf_hub_download
file_path = hf_hub_download(repo_id="mandeep-rathee/laff-model", filename="bert-base-laff.pth")
The model can loaded as follow:
from transformers import BertTokenizer, BertForSequenceClassification
from base_models import BinaryClassificationBertModel
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
base_model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(base_model_name, torch_dtype=torch.float16)
base_model = BertForSequenceClassification.from_pretrained(base_model_name, num_labels=1,torch_dtype=torch.float16)
model = BinaryClassificationBertModel(base_model)
model.load_state_dict(torch.load(file_path)) ## or saved model path
model.to(device)
Our results can be reproduced by using the run.py
file. Additionally, we have also provided the saved runs in the saved_pyterrier_runs/
folder.
We use the following combinations of budget c and |S|=s:
budget | s |
---|---|
50 | 10 |
100 | 30 |
1000 | 300 |
To reproduce the results for bm25 retriever and corpus graph for TREC DL'19 , run
python3 run.py --budget 50 --s 10 --verbose --dl_type 19
@article{rathee2024quam,
title={Quam: Adaptive Retrieval through Query Affinity Modelling},
author={Rathee, Mandeep and MacAvaney, Sean and Anand, Avishek},
journal={arXiv preprint arXiv:2410.20286},
year={2024}
}