Releases: beir-cellar/beir
v2.0.0: We are back with bugfixes and improving BEIR after a long break
After a long stale year full of no changes. I've merged many pull requests and made changes to the BEIR code. You can find the latest changes mentioned here below:
1. Heap Queue for keeping track of top-k documents when evaluating with dense retrieval.
Thanks to @kwang2049, starting from v2.0.0, we include a heap queue for keeping track of top-k documents when using the DenseRetrievalExactSearch
class module. This considerably reduces the RAM consumed, especially during the evaluation of large corpora such as MS MARCO or BIOASQ.
The logic remains the same for keeping track of elements during the chunking of the collection.
- If your
heapq
is less thank
size, push the item, i.e. document into the heap. - If your
heapq
is at maxk
size, if the item is larger than the smallest item in the heap, push the item on the heap and then pop the smallest element.
2. Removed all major typing errors from the BEIR code.
We removed all typing errors from the BEIR code as we implemented an abstract base class for search. The base class function will take in the corpus, queries, and a top-k value. We return the results, where you would have query_id
and corresponding doc_id
and score
.
class BaseSearch(ABC):
@abstractmethod
def search(self,
corpus: Dict[str, Dict[str, str]],
queries: Dict[str, str],
top_k: int,
**kwargs) -> Dict[str, Dict[str, float]]:
pass
Example: evaluate_sbert_multi_gpu.py
3. Updated Faiss Code to include GPU options.
I added the GPU option with FaissSearch
base class. Using the GPU can reduce latency immensely. However, sometimes it takes time to transfer the faiss index from CPU to GPU. Pass the use_gpu=True
parameter in the DenseRetrievalFaissSearch
class to use GPU for faiss inference with PQ, PCA, or with FlatIP Search.
4. New publication -- Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard.
We have a new publication, where we describe our official leaderboard hosted on eval.ai and provide reproducible reference models on BEIR using the Pyserini Repository (https://github.com/castorini/pyserini).
Link to the arxiv version: https://arxiv.org/abs/2306.07471
If you use numbers from our leaderboard, please cite the following paper:
@misc{kamalloo2023resources,
title={Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard},
author={Ehsan Kamalloo and Nandan Thakur and Carlos Lassance and Xueguang Ma and Jheng-Hong Yang and Jimmy Lin},
year={2023},
eprint={2306.07471},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
v1.0.1: Multi-GPU, HF dataloaders, MonoT5 rerankers and a brand new Wiki page
There have been multiple changes done to the repository ever since the last version. You can find the latest changes mentioned here below:
1. Brand New Wiki page for BEIR
Starting from v1.0.1, we have created a new Wiki page for the BEIR benchmark. We would keep it updated with the latest datasets available out there, examples of how you can evaluate your models on BEIR, Leaderboard, etc. Correspondingly we have shortened our README.md and displayed only necessary information out there. For a full overview. one can view the BEIR Wiki.
You can view the BEIR Wiki here: https://github.com/beir-cellar/beir/wiki.
2. Multi GPU evaluation with SBERT dense retrievers using Distributed Evaluation
Thanks to @NouamaneTazi, we currently now support multiple GPU evaluation for SBERT models across all datasets in BEIR. These benefit evaluation on large datasets such as BioASQ, where encoding takes 1 day at least to complete on a single GPU. Now with access to multi GPUs, one can evaluate large datasets quickly in contrast to old single GPU evaluation. Only Caveat, running on multiple GPUs requires evaluate
library to be installed which has a python version requirement of >= 3.7.
Example: evaluate_sbert_multi_gpu.py
3. Hugging Face Data loader for BEIR dataset. Uploaded all datasets on HF.
We added Hugging Face Dataloaders for all the public BEIR datasets. One can use it to easily work with BEIR datasets available on Hugging Face. We also made available all corpus and queries for eg. BeIR/fiqa
and qrels BeIR/fiqa-qrels
for all public BEIR datasets on HuggingFace. This step would mean one does not need to download the datasets and keep the locally in RAM. Again thanks to @NouamaneTazi.
You can find all datasets here: https://huggingface.co/BeIR
Example: evaluate_sbert_hf_loader.py
4. Added support for the T5 reranking model: monoT5 reranker
We added the support of the monoT5 reranking model within BEIR. These are stronger (but complex) rerankers that can be used to attain the best reranking performances currently on the BEIR benchmark.
Example: evaluate_bm25_monot5_reranking.py
5. Fix: Add ignore_identical_ids
with BEIR evaluation
Thanks to @kwang2049, we added a check to ignore identical ids within the evaluation script. This causes issues with ArguAna and Quora datasets, particularly as there a document and query can be alike (with the same id). By default, we remove these ids and evaluate the dataset accordingly. With this fix, one can evaluate Quora and ArguAna and provide the accurate and reproducible nDCG@10 scores.
5. Added HNSWSQ method in faiss retrieval methods
We added support to HNSWSQ faiss index method as a memory compression-based technique to evaluate across the BEIR datasets.
6. Added dependency of datasets library within setup.py
In order to support HF data loaders, we added the dependency of the datasets
library within our setup.py
.
v1.0.0: BEIR is back with a brand new organization of its own moving forward, New sparse model releases, ColBERT evaluation and fixing breaking changes
This is a major release since the last version v0.2.3.
1. New BEIR Organization and moving forward will be part of a collaboration
The BEIR benchmark has been shifted from UKPLab to beir-cellar. Moving forward, the BEIR benchmark will be actively maintained and developed with the help of @UKPLab, @castorini, and @huggingface.
2. ColBERT model evaluation on the BEIR benchmark code released
The ColBERT model evaluation on the BEIR benchmark has been released. This code repository uses the original ColBERT repository for evaluation and training with a few tweaks.
Here is the repository for more details: https://github.com/NThakur20/beir-ColBERT
3. New Passage Expansion Model added: TILDE
Since DocT5query is compute-intensive and time-consuming to generate, we added a faster passage expansion model: TILDE (https://arxiv.org/abs/2108.08513) for expanding documents, by expanding on relevant keywords present within the BERT vocabulary. An easy example using TILDE can be shown here: passage_expansion_tilde.py
4. Upcoming New work for Easy evaluation of Neural Sparse Retrieval Models
We are currently developing a new repository for easy evaluation of neural sparse models including a inverted index implementation. This will help a unified evaluation of all diverse neural sparse retrieval models such as uniCOIL, SPLADE, SPARTA and DeepImpact.
An initial repository for this work and more details can be found here: https://github.com/NThakur20/sparse-retrieval.
5. Fixed breaking changes and reproducibility in Elasticsearch
#58 showed issues in ES lexical search reproducibility and downloading Elasticsearch client.
- Added a sleep_for parameter in the ES code with a default value of 2 seconds. This will forcefully sleep the ES index after index deletion, and indexing documents.
- During bulk indexing (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html), there is a parameter refresh which I have set to wait_for instead of default kept at false. For more details, refer here: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html.
- Freeze ES version in beir:
elasticsearch==7.9.1
which will help us avoid the latest issues occurring with ES policies.
6. Temporary Packages: Tensorflow
Tensorflow installation was causing issues while pip installing beir
. Only USE models were evaluated using TF, however, they are currently not the most popular choice in models. Hence, we decided to move forward with ['tensorflow>=2.2.0', 'tensorflow-text', 'tensorflow-hub']
made available as optional packages which can be installed separately in case a user wishes to evaluate the USE model or use TF for their own use-case: pip install beir[tf]
.
7. Fixed breaking changes in sparse search in SparseRetrieval
As notified in #62, we have updated our bug found in the sparse retrieval code for evaluating SPARTA on the beir benchmark,
v0.2.3: NeurIPS acceptance, Multilingual datasets and Top-k accuracy metric fixed
This is a small release update!
1. BEIR Benchmark paper accepted at NeurIPS 2021 (Datasets and Benchmark Track)
I'm quite thrilled to share that BEIR has been accepted at NeurIPS 2021 conference. All reviewers had positive reviews and realized the benchmark to be useful for the community. More information can be found here: https://openreview.net/forum?id=wCu6T5xFjeJ.
2. New Multilingual datasets added within BEIR
New multilingual datasets have been added to the BEIR Benchmark. Now BEIR supports over 10+ languages. We included the translated MSMARCO dataset in 8 languages: mMARCO (https://github.com/unicamp-dl/mMARCO) and Mr.TyDi which contains train, development, and test data across 10 languages (https://github.com/castorini/mr.tydi). We hope to provide good and robust dense multilingual retrievers in the future.
3. Breaking change in Top-k accuracy now fixed
The top-k accuracy metric was by mistake sorting retrieved keys instead of retriever model scores which would have led to incorrect scores. This mistake has been identified in #45 and successfully updated and merged now.
4. Yannic Kilcher recognized BEIR as a helpful ML library
Yannic Kilcher recently mentioned the BEIR repository as a helpful library for benchmarking and evaluating diverse IR models and architectures. You can find more details in his latest ML News video on YouTube: https://www.youtube.com/watch?v=K3cmxn5znyU&t=1290s&ab_channel=YannicKilcher
v0.2.2: Margin-MSE loss for training dense retrievers and Open-NLP Meetup
This is a small release update! We made the following changes in the release of the beir package:
1. Now train dense retrievers (SBERT) models using Margin-MSE loss
We have added a new loss, Margin-MSE, which learns a pairwise score using hard negatives and positives. Thanks to @kwang2049 for the motivation, we have now added the loss to the beir repo. The loss is most effective with a Knowledge distillation setup using a powerful teacher model. For more details, we would suggest you refer to the paper by Hofstätter et al., https://arxiv.org/abs/2010.02666.
Margin-MSE Loss function: https://github.com/UKPLab/beir/blob/main/beir/losses/margin_mse_loss.py
Train (SOTA) SBERT model using Margin-MSE: https://github.com/UKPLab/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py
2. Spoke about Neural Search and BEIR in the OpenNLP Meetup at 12.08.2021
I had fun speaking about BEIR and Neural Search in a recent OpenNLP event on benchmarking search using BEIR.
If you are interested, the talk was recorded and is available below:
YouTube: https://www.youtube.com/watch?v=e9nNr4ugNAo
Slides: https://drive.google.com/file/d/1gghRVv6nWWmMZRqkYvuCO0HWOTEVnPNz/view?usp=sharing
3. Added Splits for each dataset in the datasets table present in README
I plan to add the new big msmarco-v2 version of the passage collection soon, this contains 138,364,198 passages (13.5 GB). The dataset contains two dev splits (dev1
,dev2
). Adding splits would be useful to incorporate different splits that don't follow the traditional convention of a single train, dev and test splits.
v0.2.1: Bugs and Datasets Fixed and Minor Updates
1. New script to utilize docT5query in parallel with multiple GPUs!
- Thanks to @joshdevins, we have a new script to utilize multiple GPUs in parallel to generate multiple queries for passages using a question generation model faster. Check it out [here].
- You can now pass your custom GPU device if CUDA recognizable devices are not present for question generation.
2. PQ Hashing with OPQ Rotation and Scalar Quantizer from Faiss!
- Now you can utilize
OPQ
rotation before using PQ hashing and Scalar Quantizer forfp16
faiss search instead of originalfp32
.
3. Top-k Accuracy Metric which is commonly used in the DPR repository by facebook!
- DPR repository evaluates retrieval models using the
top-k retriever accuracy
. This would allow evaluating top-k accuracy using the BEIR repository!
top_k_accuracy = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="top_k_accuracy")
4. Sorting of corpus documents by text length before encoding using a dense retriever!
- We now sort the corpus documents by longest size first, This has two advantages:
- Why Sort? Similar lengths of texts are now encoded within a single batch, this would help speed up the corpus encoding process.
- Why Sort longest to smallest? max GPU memory required can be found out in the beginning, so if OOM occurs it will occur in the beginning.
5. FEVER dataset training qrels, problems with doc-ids with special characters now fixed!
- There were issues with training qrels in the FEVER dataset. The doc-ids with special characters, for eg.
Zlatan_Ibrahimović
orBeyoncé
had the wrong special characters present inqrels/train.tsv
. These were manually fixed by myself and now there are no more similar issues present in the corpus. - New md5hash for the
fever.zip
dataset:5a818580227bfb4b35bb6fa46d9b6c03
.
v0.2.0: New Features Integrated with BEIR
FAISS Indexes and Search Integration
- FAISS Indexes can be created and used for evaluation using the BEIR repository. We have added support to
Flat-IP
,HNSW
,PQ
,PCAMatrix
, andBinaryFlat
Indexes. - Faiss indexes use various compression algorithms useful for reducing Index memory sizes or improving retrieval speed.
- You can also save your corpus embeddings as a faiss index, which wasn't possible with the exact search originally.
- Check out how to evaluate dense retrieval using a faiss index [here] and dimension reduction using PCA [here].
Multilingual Datasets and Evaluation
- Thanks to @julian-risch, we have added our first multilingual dataset to the BEIR repository - GermanQuAD (German SQuAD dataset).
- We have changed Elasticsearch now to allow evaluation on languages apart from English, check it out [here].
- We also have added a DPR model class which lets you load DPR models from Huggingface Repo, you can use this Class now for evaluation let's say the GermanDPR model [link].
DeepCT evaluation
- We have transformed the original DeepCT code to be able to use tensorflow (tf) >v2.0 and now hosted the latest repo [here].
- Using the hosted code, we are now able to use DeepCT for evaluation in BEIR using Anserini Retrieval, check [here].
Training Latest MSMARCO v3 Models
- From the SentenceTransformers repository, we have integrated the latest training code for MSMARCO on custom manually provided hard negatives. This provides the state-of-the-art SBERT models trained on MSMARCO, check [here].
Using Multiple-GPU for question-generation
- A big challenge was to use multiple GPUs for the generation of questions much faster. We have included Process-pools to generate questions much faster and now using multiple GPUs also in parallel, check [here].
Integration of Binary Passage Retrievers (BPR)
- BPR (ACL'21, link) is now integrated within the BEIR benchmark. Now you can easily train a state-of-the-art BPR model on MSMARCO using the loss function described in the original paper, check [here].
- You can also evaluate BPR now easily now in a zero-shot evaluation fashion, check [here].
- We would soon open-source the BPR public models trained on MSMARCO.