Skip to content

An easy-to-use python toolkit for flexibly adapting various neural ranking models to any target domain.

License

Notifications You must be signed in to change notification settings

jingtaozhan/disentangled-retriever

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Disentangled Neural Ranking

Disentangle License made-with-pytorch code-size

This is the official repo for our paper Disentangled Modeling of Domain and Relevance for Adaptable Dense Retrieval. Disentangled Neural Ranking is a novel paradigm that supports effective and flexible domain adaptation for neural ranking models including Dense Retrieval, uniCOIL, SPLADE, ColBERT, and BERT re-ranker.

Features

  • One command for unsupervised and effective domain adaption.
  • One command for effective few-shot domain adaption.
  • Various ranking architectures, including Dense Retrieval, uniCOIL, SPLADE, ColBERT, and BERT re-ranker.
  • Two source-domain finetuning methods, contrastive finetuning and distillation.
  • Huggingface-style training and inference, supporting multi-gpus, mixed precision, etc.

Quick Links

Quick Tour

Neural Ranking models are vulnerable to domain shift: the trained models may even perform worse than traditional retrieval methods like BM25 in out-of-domain scenarios.

In this work, we propose Disentangled Neural Ranking (DNR) to support effective and flexible domain adaptation. DNR consists of a Relevance Estimation Module (REM) for modeling domain-invariant matching patterns and several Domain Adaption Modules (DAMs) for modeling domain-specific features of multiple target corpora. DNR enables a flexible training paradigm in which REM is trained with supervision once and DAMs are trained with unsupervised data.

Neural Ranking Disentangled Neural Ranking

The idea of DNR can date back to classic retrieval models in the pre-neural-ranking era. BM25 utilizes the same formula for estimating relevance scores across domains but measures word importance with corpus-specific IDF values. However, it does not exist in vanilla neural ranking models where the abilities of relevance estimation and domain modeling are jointly learned during training and entangled within the model parameters.

Here are two examples when we apply disentangled modeling for domain adaption. We plot the figure where y-axis shows the relative improvement over BM25 and x-axis shows different out-of-domain test sets. The ranking performance of Dense Retrieval (DR) and Disentangled Dense Retrieval (DDR) is shown below.

NDCG@10 Recall@1000

The ranking performance of ColBERT and Disentangled ColBERT (D-ColBERT) is shown below.

NDCG@10 Recall@1000

Disentangled modeling brings amazing out-of-domain performance gains! More details are available in our paper.

Installation

This repo is developed with PyTorch and Faiss. They should be installed manually due to the requirement of platform-specific custom configuration. In our development, we run the following commands for installation.

# XX.X is a placeholder for cudatoolkit version. It should be specified according to your environment
conda install pytorch torchvision torchaudio cudatoolkit=XX.X -c pytorch 
conda install -c conda-forge faiss-gpu

After these, now you can install from our code:

git clone https://github.com/jingtaozhan/disentangled-retriever
cd disentangled-retriever
pip install .

For development, use

pip install --editable .

Released Models

We release about 50 models to facilitate reproducibility and reusage. You do not have to manually download these. They will be automatically downloaded at runtime.

Relevance Estimation Modules for Dense Retrieval (click to expand)
Relevance Estimation Modules for UniCOIL (click to expand)
Relevance Estimation Modules for SPLADE (click to expand)
Relevance Estimation Modules for ColBERT (click to expand)
Relevance Estimation Modules for BERT re-ranker (click to expand)
Domain Adaption Modules for various datasets (click to expand)

Besides Disentangled Neural Ranking models, we also release the vanilla/traditional neural ranking models, which are baselines in our paper.

Vanilla Neural Ranking Checkpoints (click to expand)
Vanilla Dense Retrieval (click to expand)
Vanilla uniCOIL (click to expand)
Vanilla SPLADE (click to expand)
Vanilla ColBERT (click to expand)
Vanilla BERT re-ranker (click to expand)
*Note: Our code also supports training and evaluating vanilla neural ranking models!*

Example usage:

Here is an example about using disentangled dense retrieval for ranking. The REM is generic, while the DAM is domain-specifically trained to mitigate the domain shift. The two modules are assembled during inference.

from transformers import AutoConfig, AutoTokenizer
from disentangled_retriever.dense.modeling import AutoDenseModel

# This is the Relevance Estimation Module (REM) contrastively trained on MS MARCO
# It can be used in various English domains.
REM_URL = "https://huggingface.co/jingtao/REM-bert_base-dense-contrast-msmarco/resolve/main/lora192-pa4.zip"
## For example, we will apply the model to TREC-Covid dataset. 
# Here is the Domain Adaption Module for this dataset.
DAM_NAME = "jingtao/DAM-bert_base-mlm-msmarco-trec_covid"

## Load the modules
config = AutoConfig.from_pretrained(DAM_NAME)
config.similarity_metric, config.pooling = "ip", "average"
tokenizer = AutoTokenizer.from_pretrained(DAM_NAME, config=config)
model = AutoDenseModel.from_pretrained(DAM_NAME, config=config)
adapter_name = model.load_adapter(REM_URL)
model.set_active_adapters(adapter_name)
model.merge_lora(adapter_name)

## Let's try to compute the similarities
queries  = ["When will the COVID-19 pandemic end?", "What are the impacts of COVID-19 pandemic to society?"]
passages = ["It will end soon.", "It makes us care for each other."]
query_embeds = model(**tokenizer(queries, return_tensors="pt", padding=True, truncation=True, max_length=512))
passage_embeds = model(**tokenizer(passages, return_tensors="pt", padding=True, truncation=True, max_length=512))

print(query_embeds @ passage_embeds.T)

Results are:

tensor([[107.6821, 101.4270],
        [103.7373, 105.0448]], grad_fn=<MmBackward0>)

Preparing datasets

We will use various datasets to show how disentangled modeling facilitates flexible domain adaption. Before the demonstration, please download and preprocess the corresponding datasets. Here we provide detailed instructions:

Zero-shot Domain Adaption

Suppose you already have a REM module (trained by yourself or provided by us) and you need to adapt the model to an unseen domain. To do this, just train a Domain Adaption Module (DAM) to mitigate the domain shift.

The training process is completely unsupervised and only requires the target-domain corpus. Each line of the corpus file should be formatted as `id doc' separated by tab. Then you can train a DAM model with only one command.

python -m torch.distributed.launch --nproc_per_node 4 \
    -m disentangled_retriever.adapt.run_adapt_with_mlm \
    --corpus_path ... ... ...

The trained DAM can be combined with different REMs and formalize a well-performing neural ranking models. We provide many REMs (see this section) that correspond to different ranking methods or are trained with different losses. The trained DAM can be combined with any REM to become an effective ranking model, e.g., a Dense Retrievla model or a ColBERT model. For example, if you want to acquire a dense retrieval model, use the following command for inference:

python -m torch.distributed.launch --nproc_per_node 4 \
    -m disentangled_retriever.dense.evaluate.run_eval \
    --backbone_name_or_path [path-to-the-trained-DAM] \
    --adapter_name_or_path [path-to-the-dense-retrieval-rem] \
    --corpus_path ... --query_path ... ... ...

If you want to acquire a ColBERT model, use the following command for inference:

python -m torch.distributed.launch --nproc_per_node 4 \
    -m disentangled_retriever.colbert.evaluate.run_eval \
    --backbone_name_or_path [path-to-the-trained-DAM] \
    --adapter_name_or_path [path-to-the-dense-retrieval-rem] \
    --corpus_path ... --query_path ... ... ...

We give two adaption examples. They train a separate DAM in the target domain and re-use our released REMs.

Please try these examples before using our methods on your own datasets.

Few-shot Domain Adaption

Coming soon.

Learning Generic Relevance Estimation Ability

We already release a bunch of Relevance Estimation Modules (REMs) for various kinds of ranking methods. You can directly adopt these public checkpoints. But if you have some private labeled data and want to a Relevance Estimation Module (REM) on it, here we provide instructions on how to do this.

To directly use this codebase for training, you need to convert your dataformat as follows

  • corpus.tsv: corpus file. each line is `docid doc' separated by tab.
  • query.train: training queries. each line is `qid query' separated by tab.
  • qrels.train: annotations. each line is `qid 0 docid rel_level' separated by tab.
  • [Optional] hard negative file for contrastive training: each line is `qid neg_docid1 neg_docid2 ...'. qid and neg_docids are separated by tab. neg_docids are separated by space.
  • [Optional] soft labels for knowledge distillation: a pickle file containing a dict: {qid: {docid: score}}. It should contain the soft labels of positive pairs and of several negative pairs.

If you still have questions about the data formatting, you can check how we convert MS MARCO.

With formatted supervised data, now you can train a REM module. We use a disentangled finetuning trick: first training a DAM module to capture domain-specific features and then training the REM module to learn domain-invariant matching patterns.

Here we provide instructions about training REMs for different ranking methods.

  • Train REM for Dense Retrieval: on English MS MARCO | on Chinese Dureader
  • Train REM for uniCOIL: [on English] [on Chinese] (coming soon)
  • Train REM for SPLADE: [on English] [on Chinese] (coming soon)
  • Train REM for ColBERT: [on English] [on Chinese] (coming soon)
  • Train REM for BERT re-ranker: [on English] [on Chinese] (coming soon)

Reproducing Results with Released Checkpoints

We provide commands for reproducing the various results in our paper.

Training Vanilla Neural Ranking Models

This powerful codebase not only supports Disentangled Neural Ranking, but also vanilla Neural Ranking models. You can easily reproduce state-of-the-art Dense Retrieval, uniCOIL, SPLADE, ColBERT, and BERT rerankers using this codebase! The instructions are provided as below.

Citation

If you find our work useful, please consider citing us :)

@article{zhan2022disentangled,
  title={Disentangled Modeling of Domain and Relevance for Adaptable Dense Retrieval},
  author={Zhan, Jingtao and Ai, Qingyao and Liu, Yiqun and Mao, Jiaxin and Xie, Xiaohui and Zhang, Min and Ma, Shaoping},
  journal={arXiv preprint arXiv:2208.05753},
  year={2022}
}

About

An easy-to-use python toolkit for flexibly adapting various neural ranking models to any target domain.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages