Skip to content

Releases: beir-cellar/beir

v2.0.0: We are back with bugfixes and improving BEIR after a long break

03 Aug 21:42
Compare
Choose a tag to compare

After a long stale year full of no changes. I've merged many pull requests and made changes to the BEIR code. You can find the latest changes mentioned here below:

1. Heap Queue for keeping track of top-k documents when evaluating with dense retrieval.

Thanks to @kwang2049, starting from v2.0.0, we include a heap queue for keeping track of top-k documents when using the DenseRetrievalExactSearch class module. This considerably reduces the RAM consumed, especially during the evaluation of large corpora such as MS MARCO or BIOASQ.

The logic remains the same for keeping track of elements during the chunking of the collection.

  • If your heapq is less than k size, push the item, i.e. document into the heap.
  • If your heapq is at max k size, if the item is larger than the smallest item in the heap, push the item on the heap and then pop the smallest element.

2. Removed all major typing errors from the BEIR code.

We removed all typing errors from the BEIR code as we implemented an abstract base class for search. The base class function will take in the corpus, queries, and a top-k value. We return the results, where you would have query_id and corresponding doc_id and score.

class BaseSearch(ABC):

    @abstractmethod
    def search(self, 
               corpus: Dict[str, Dict[str, str]], 
               queries: Dict[str, str], 
               top_k: int, 
               **kwargs) -> Dict[str, Dict[str, float]]:
        pass

Example: evaluate_sbert_multi_gpu.py

3. Updated Faiss Code to include GPU options.

I added the GPU option with FaissSearch base class. Using the GPU can reduce latency immensely. However, sometimes it takes time to transfer the faiss index from CPU to GPU. Pass the use_gpu=True parameter in the DenseRetrievalFaissSearch class to use GPU for faiss inference with PQ, PCA, or with FlatIP Search.

4. New publication -- Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard.

We have a new publication, where we describe our official leaderboard hosted on eval.ai and provide reproducible reference models on BEIR using the Pyserini Repository (https://github.com/castorini/pyserini).

Link to the arxiv version: https://arxiv.org/abs/2306.07471

If you use numbers from our leaderboard, please cite the following paper:

@misc{kamalloo2023resources,
      title={Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard}, 
      author={Ehsan Kamalloo and Nandan Thakur and Carlos Lassance and Xueguang Ma and Jheng-Hong Yang and Jimmy Lin},
      year={2023},
      eprint={2306.07471},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

v1.0.1: Multi-GPU, HF dataloaders, MonoT5 rerankers and a brand new Wiki page

30 Jun 23:25
Compare
Choose a tag to compare

There have been multiple changes done to the repository ever since the last version. You can find the latest changes mentioned here below:

1. Brand New Wiki page for BEIR

Starting from v1.0.1, we have created a new Wiki page for the BEIR benchmark. We would keep it updated with the latest datasets available out there, examples of how you can evaluate your models on BEIR, Leaderboard, etc. Correspondingly we have shortened our README.md and displayed only necessary information out there. For a full overview. one can view the BEIR Wiki.

You can view the BEIR Wiki here: https://github.com/beir-cellar/beir/wiki.

2. Multi GPU evaluation with SBERT dense retrievers using Distributed Evaluation

Thanks to @NouamaneTazi, we currently now support multiple GPU evaluation for SBERT models across all datasets in BEIR. These benefit evaluation on large datasets such as BioASQ, where encoding takes 1 day at least to complete on a single GPU. Now with access to multi GPUs, one can evaluate large datasets quickly in contrast to old single GPU evaluation. Only Caveat, running on multiple GPUs requires evaluate library to be installed which has a python version requirement of >= 3.7.

Example: evaluate_sbert_multi_gpu.py

3. Hugging Face Data loader for BEIR dataset. Uploaded all datasets on HF.

We added Hugging Face Dataloaders for all the public BEIR datasets. One can use it to easily work with BEIR datasets available on Hugging Face. We also made available all corpus and queries for eg. BeIR/fiqa and qrels BeIR/fiqa-qrels for all public BEIR datasets on HuggingFace. This step would mean one does not need to download the datasets and keep the locally in RAM. Again thanks to @NouamaneTazi.

You can find all datasets here: https://huggingface.co/BeIR
Example: evaluate_sbert_hf_loader.py

4. Added support for the T5 reranking model: monoT5 reranker

We added the support of the monoT5 reranking model within BEIR. These are stronger (but complex) rerankers that can be used to attain the best reranking performances currently on the BEIR benchmark.

Example: evaluate_bm25_monot5_reranking.py

5. Fix: Add ignore_identical_ids with BEIR evaluation

Thanks to @kwang2049, we added a check to ignore identical ids within the evaluation script. This causes issues with ArguAna and Quora datasets, particularly as there a document and query can be alike (with the same id). By default, we remove these ids and evaluate the dataset accordingly. With this fix, one can evaluate Quora and ArguAna and provide the accurate and reproducible nDCG@10 scores.

5. Added HNSWSQ method in faiss retrieval methods

We added support to HNSWSQ faiss index method as a memory compression-based technique to evaluate across the BEIR datasets.

6. Added dependency of datasets library within setup.py

In order to support HF data loaders, we added the dependency of the datasets library within our setup.py.

v1.0.0: BEIR is back with a brand new organization of its own moving forward, New sparse model releases, ColBERT evaluation and fixing breaking changes

21 Mar 17:42
Compare
Choose a tag to compare

This is a major release since the last version v0.2.3.

1. New BEIR Organization and moving forward will be part of a collaboration

The BEIR benchmark has been shifted from UKPLab to beir-cellar. Moving forward, the BEIR benchmark will be actively maintained and developed with the help of @UKPLab, @castorini, and @huggingface.

2. ColBERT model evaluation on the BEIR benchmark code released

The ColBERT model evaluation on the BEIR benchmark has been released. This code repository uses the original ColBERT repository for evaluation and training with a few tweaks.

Here is the repository for more details: https://github.com/NThakur20/beir-ColBERT

3. New Passage Expansion Model added: TILDE

Since DocT5query is compute-intensive and time-consuming to generate, we added a faster passage expansion model: TILDE (https://arxiv.org/abs/2108.08513) for expanding documents, by expanding on relevant keywords present within the BERT vocabulary. An easy example using TILDE can be shown here: passage_expansion_tilde.py

4. Upcoming New work for Easy evaluation of Neural Sparse Retrieval Models

We are currently developing a new repository for easy evaluation of neural sparse models including a inverted index implementation. This will help a unified evaluation of all diverse neural sparse retrieval models such as uniCOIL, SPLADE, SPARTA and DeepImpact.

An initial repository for this work and more details can be found here: https://github.com/NThakur20/sparse-retrieval.

5. Fixed breaking changes and reproducibility in Elasticsearch

#58 showed issues in ES lexical search reproducibility and downloading Elasticsearch client.

  1. Added a sleep_for parameter in the ES code with a default value of 2 seconds. This will forcefully sleep the ES index after index deletion, and indexing documents.
  2. During bulk indexing (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html), there is a parameter refresh which I have set to wait_for instead of default kept at false. For more details, refer here: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html.
  3. Freeze ES version in beir: elasticsearch==7.9.1 which will help us avoid the latest issues occurring with ES policies.

6. Temporary Packages: Tensorflow

Tensorflow installation was causing issues while pip installing beir. Only USE models were evaluated using TF, however, they are currently not the most popular choice in models. Hence, we decided to move forward with ['tensorflow>=2.2.0', 'tensorflow-text', 'tensorflow-hub'] made available as optional packages which can be installed separately in case a user wishes to evaluate the USE model or use TF for their own use-case: pip install beir[tf].

7. Fixed breaking changes in sparse search in SparseRetrieval

As notified in #62, we have updated our bug found in the sparse retrieval code for evaluating SPARTA on the beir benchmark,

v0.2.3: NeurIPS acceptance, Multilingual datasets and Top-k accuracy metric fixed

22 Oct 21:19
Compare
Choose a tag to compare

This is a small release update!

1. BEIR Benchmark paper accepted at NeurIPS 2021 (Datasets and Benchmark Track)

I'm quite thrilled to share that BEIR has been accepted at NeurIPS 2021 conference. All reviewers had positive reviews and realized the benchmark to be useful for the community. More information can be found here: https://openreview.net/forum?id=wCu6T5xFjeJ.

2. New Multilingual datasets added within BEIR

New multilingual datasets have been added to the BEIR Benchmark. Now BEIR supports over 10+ languages. We included the translated MSMARCO dataset in 8 languages: mMARCO (https://github.com/unicamp-dl/mMARCO) and Mr.TyDi which contains train, development, and test data across 10 languages (https://github.com/castorini/mr.tydi). We hope to provide good and robust dense multilingual retrievers in the future.

3. Breaking change in Top-k accuracy now fixed

The top-k accuracy metric was by mistake sorting retrieved keys instead of retriever model scores which would have led to incorrect scores. This mistake has been identified in #45 and successfully updated and merged now.

4. Yannic Kilcher recognized BEIR as a helpful ML library

Yannic Kilcher recently mentioned the BEIR repository as a helpful library for benchmarking and evaluating diverse IR models and architectures. You can find more details in his latest ML News video on YouTube: https://www.youtube.com/watch?v=K3cmxn5znyU&t=1290s&ab_channel=YannicKilcher

v0.2.2: Margin-MSE loss for training dense retrievers and Open-NLP Meetup

17 Aug 19:00
Compare
Choose a tag to compare

This is a small release update! We made the following changes in the release of the beir package:

1. Now train dense retrievers (SBERT) models using Margin-MSE loss

We have added a new loss, Margin-MSE, which learns a pairwise score using hard negatives and positives. Thanks to @kwang2049 for the motivation, we have now added the loss to the beir repo. The loss is most effective with a Knowledge distillation setup using a powerful teacher model. For more details, we would suggest you refer to the paper by Hofstätter et al., https://arxiv.org/abs/2010.02666.

Margin-MSE Loss function: https://github.com/UKPLab/beir/blob/main/beir/losses/margin_mse_loss.py
Train (SOTA) SBERT model using Margin-MSE: https://github.com/UKPLab/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py

2. Spoke about Neural Search and BEIR in the OpenNLP Meetup at 12.08.2021

I had fun speaking about BEIR and Neural Search in a recent OpenNLP event on benchmarking search using BEIR.
If you are interested, the talk was recorded and is available below:

YouTube: https://www.youtube.com/watch?v=e9nNr4ugNAo
Slides: https://drive.google.com/file/d/1gghRVv6nWWmMZRqkYvuCO0HWOTEVnPNz/view?usp=sharing

3. Added Splits for each dataset in the datasets table present in README

I plan to add the new big msmarco-v2 version of the passage collection soon, this contains 138,364,198 passages (13.5 GB). The dataset contains two dev splits (dev1,dev2). Adding splits would be useful to incorporate different splits that don't follow the traditional convention of a single train, dev and test splits.

v0.2.1: Bugs and Datasets Fixed and Minor Updates

19 Jul 16:07
Compare
Choose a tag to compare

1. New script to utilize docT5query in parallel with multiple GPUs!

  • Thanks to @joshdevins, we have a new script to utilize multiple GPUs in parallel to generate multiple queries for passages using a question generation model faster. Check it out [here].
  • You can now pass your custom GPU device if CUDA recognizable devices are not present for question generation.

2. PQ Hashing with OPQ Rotation and Scalar Quantizer from Faiss!

  • Now you can utilize OPQ rotation before using PQ hashing and Scalar Quantizer for fp16 faiss search instead of original fp32.

3. Top-k Accuracy Metric which is commonly used in the DPR repository by facebook!

  • DPR repository evaluates retrieval models using the top-k retriever accuracy. This would allow evaluating top-k accuracy using the BEIR repository!
top_k_accuracy = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="top_k_accuracy")

4. Sorting of corpus documents by text length before encoding using a dense retriever!

  • We now sort the corpus documents by longest size first, This has two advantages:
    1. Why Sort? Similar lengths of texts are now encoded within a single batch, this would help speed up the corpus encoding process.
    2. Why Sort longest to smallest? max GPU memory required can be found out in the beginning, so if OOM occurs it will occur in the beginning.

5. FEVER dataset training qrels, problems with doc-ids with special characters now fixed!

  • There were issues with training qrels in the FEVER dataset. The doc-ids with special characters, for eg. Zlatan_Ibrahimović or Beyoncé had the wrong special characters present in qrels/train.tsv. These were manually fixed by myself and now there are no more similar issues present in the corpus.
  • New md5hash for the fever.zip dataset: 5a818580227bfb4b35bb6fa46d9b6c03.

v0.2.0: New Features Integrated with BEIR

06 Jul 16:38
Compare
Choose a tag to compare

FAISS Indexes and Search Integration

  • FAISS Indexes can be created and used for evaluation using the BEIR repository. We have added support to Flat-IP, HNSW, PQ, PCAMatrix, and BinaryFlat Indexes.
  • Faiss indexes use various compression algorithms useful for reducing Index memory sizes or improving retrieval speed.
  • You can also save your corpus embeddings as a faiss index, which wasn't possible with the exact search originally.
  • Check out how to evaluate dense retrieval using a faiss index [here] and dimension reduction using PCA [here].

Multilingual Datasets and Evaluation

  • Thanks to @julian-risch, we have added our first multilingual dataset to the BEIR repository - GermanQuAD (German SQuAD dataset).
  • We have changed Elasticsearch now to allow evaluation on languages apart from English, check it out [here].
  • We also have added a DPR model class which lets you load DPR models from Huggingface Repo, you can use this Class now for evaluation let's say the GermanDPR model [link].

DeepCT evaluation

  • We have transformed the original DeepCT code to be able to use tensorflow (tf) >v2.0 and now hosted the latest repo [here].
  • Using the hosted code, we are now able to use DeepCT for evaluation in BEIR using Anserini Retrieval, check [here].

Training Latest MSMARCO v3 Models

  • From the SentenceTransformers repository, we have integrated the latest training code for MSMARCO on custom manually provided hard negatives. This provides the state-of-the-art SBERT models trained on MSMARCO, check [here].

Using Multiple-GPU for question-generation

  • A big challenge was to use multiple GPUs for the generation of questions much faster. We have included Process-pools to generate questions much faster and now using multiple GPUs also in parallel, check [here].

Integration of Binary Passage Retrievers (BPR)

  • BPR (ACL'21, link) is now integrated within the BEIR benchmark. Now you can easily train a state-of-the-art BPR model on MSMARCO using the loss function described in the original paper, check [here].
  • You can also evaluate BPR now easily now in a zero-shot evaluation fashion, check [here].
  • We would soon open-source the BPR public models trained on MSMARCO.