Pyserini: A Conceptual Framework for Retrieval

This guide presents a conceptual framework for a representational approach to information retrieval that integrates dense and sparse representations into the same underlying (bi-encoder) architecture.

If you're a Waterloo student traversing the onboarding path, make sure you've first done all the exercises leading up to this guide, starting here. The previous step in the onboarding path is to reproduce BM25 baselines for the MS MARCO passage ranking task in Pyserini. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.

Learning outcomes for this guide, building on previous steps in the onboarding path:

Understand how sparse and dense representations can be viewed as variations in a bi-encoder architecture.
Be able to identify correspondences between Lucene indexing and retrieval operations within the above conceptual framework.
Be able to extract the BM25 vector representation of a document from a Lucene index and compute its inner product with respect to a query.
Understand the difference between dense and sparse representations, and between supervised (learned) and unsupervised (heuristic) representations.

Bi-Encoders

As a recap from here, this is the "core retrieval" problem that we're trying to solve:

Given an information need expressed as a query q, the text retrieval task is to return a ranked list of k texts {d₁, d₂ ... d_k} from an arbitrarily large but finite collection of texts C = {d_i} that maximizes a metric of interest, for example, nDCG, AP, etc.

How might we tackle the challenge? One approach, known as a bi-encoder (or dual-encoder) architecture, is presented below:

The approach is conceptually quite simple: Let's say we have two "encoders":

The document encoder takes a document and generates a representation of the document.
The query encoder takes a query and generates a representation of the query.

In addition, we have a comparison function that takes two representations (one from a document, the other from a query) and produces an estimate of relevance, i.e., the degree to which the document is relevant to the query. In other words, the comparison function produces a relevance score; alternatively, we call this a query-document score.

Let's assume that the encoders generate representations in the form of vectors, and that the score is computed in terms of the inner product (or dot product) between the document and query vectors.

Let's further assume that the encoders have been designed or built in such a way that the larger the score (i.e., inner product), the more relevant the document is to the query. (Exactly how, we'll see later.)

Given this setup, how would we build a retrieval system? Well, here's one obvious way:

Step (1). Let's take the document collection (or corpus), i.e., C = {d_i}, and encode each document. So we have a bunch of vectors now, each corresponding to one document.

Step (2). When a query arrives, we need to encode it, i.e., generate its vector representation.

Step (3). In the final step, we need to find the k document vectors that have the highest query-document scores in terms of the inner product of their vector representations. We say k because in nearly all settings, k is specified externally, i.e., the user says, give me the top 10 hits. Hence, top-k retrieval.

Step (1) and step (2) are relatively straightforward given the document and query encoders. Encoding the document collection is embarrassingly parallel and encoding the query happens at search time.

Step (3) has a very naive implementation: take the query vector, compute its score with the vector of the first document, compute its score with the vector of the second document, compute its score with the vector of the third document, etc. Repeat until we iterate through all documents, keeping track of the top-k scores along the way (e.g., in a heap). In other words, compute all inner products in a brute-force manner.

Don't laugh, this isn't as ridiculous as it sounds! (For later, this is in fact what's happening with a FlatIP index in Faiss.) However, researchers have developed more efficient data structures and top-k retrieval algorithms for vectors of different types. As a preview: for sparse vectors, we use inverted indexes, and for dense vectors, we use HNSW.

BM25 as a Bi-Encoder

Now, okay, what does that have to do with retrieval using BM25?

Well, BM25 is simply an "instantiation" of the above bi-encoder framework, where BM25 is the document encoder and the query encoder generates a so-called multi-hot vector, like this:

Wait, seriously?

Yup! Let's consider docid 7187158, the answer to the query about Paula Deen's brother:

Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba'sâ�¦ Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba'sâ�

This is the BM25 vector representation for that document:

{
    "be": 2.637899875640869,
    "brother": 4.09124231338501,
    "bubba": 7.102361679077148,
    "bubba's\u00e2": 11.091651916503906,
    "deen": 7.4197235107421875,
    "earl": 5.663764953613281,
    "former": 3.8262834548950195,
    "gener": 2.2932770252227783,
    "her": 2.7393782138824463,
    "hier": 8.24051284790039,
    "manag": 2.832794189453125,
    "paula": 6.438521862030029,
    "su": 5.404428005218506,
    "uncl": 5.362298488616943,
    "w": 3.9339818954467773
}

This requires a bit of explanation. As previously mentioned, BM25 is a so-called "bag-of-words" model, which means that the representation of the document is a sparse vector, which each dimension corresponds to a term in the vocabulary space. Usually, we say "terms" (or sometimes "tokens") instead of "words" because there's some amount of processing to tokenize the input document. Typically, words are converted into lower case, "stopwords" are discarded, and some form of stemming is applied (e.g., to remove suffixes).

In a bag-of-words representation, the dimension of the sparse vector is the size of the vocabulary space, i.e., the number of unique terms in the entire collection (i.e., all documents). It's called a "bag of words" because we're only keeping track of the individual terms in a document, neglecting other important aspects of "meaning" such as the order of the terms, linguistic relationships, etc. Sometimes, these are called sparse lexical vectors, to emphasize that their dimensions correspond to lexical items (i.e., terms in the vocabulary).

Since most of the entries in a sparse vector (representing a document) are zero, it can be conveniently represented in JSON, with the keys denoting the terms that have non-zero weights (or scores). This is what we've done above. The weights (or scores) of those terms are determined by the BM25 scoring function, which is a function of various term statistics such as the term frequency (the number of times a term appears in a document), the term's document frequency (the number of documents in the collection that contains the term), the document length, etc.

However, the high-level idea is that "important" terms get high weights, and "unimportant" terms get low weights. BM25 is just one (of many) scoring functions that attempt to capture this intuition.

Who came up with the BM25 scoring function? Its origins date back to the 1970s, but you can find a good overview from 2009 here.

So, the indexing phase with Lucene, as in the previous exercise you've done, corresponds to the green box below:

Conceptually, you've computed the BM25 document vector of every document in the collection and stored them in a data structure called an inverted index. (Actually, in reality, the inverted index only stores component statistics that allow you to reconstruct the BM25 document vectors.)

With the IndexReader class in Pyserini, you can materialize (i.e., reconstruct) the BM25 document vector for a particular document:

from pyserini.index.lucene import IndexReader
import json

index_reader = IndexReader('indexes/lucene-index-msmarco-passage')
tf = index_reader.get_document_vector('7187158')
bm25_weights = \
    {term: index_reader.compute_bm25_term_weight('7187158', term, analyzer=None) \
     for term in tf.keys()}

print(json.dumps(bm25_weights, indent=4, sort_keys=True))

The above snippet of code will generate exactly the same JSON above.

Once you've built an index, the retrieval stage corresponds to the purple box below:

The below snippet of code generates the query representation:

from pyserini.analysis import Analyzer, get_lucene_analyzer

analyzer = Analyzer(get_lucene_analyzer())
query_tokens = analyzer.analyze('what is paula deen\'s brother')
multihot_query_weights = {k: 1 for k in query_tokens}

The query tokens (query_tokens) are:

['what', 'paula', 'deen', 'brother']

The query representation is simply a sparse vector where all the query tokens get a score (weight) of one. This is also known as a "multi-hot vector". We represent the query vector as a Python dictionary in multihot_query_weights.

As described above, the top-k retrieval problem is to find the k documents from the collection that have the highest inner product (between a document vector and the query vector). Without getting into details, the inverted index allows this top-k to be computed efficiently.

Here, we can manually compute the inner product between the query vector and the document vector:

import numpy as np

# Gather up the dimensions (i.e., the combined dictionary).
terms = set.union(set(bm25_weights.keys()), set(multihot_query_weights.keys()))

bm25_vec = np.array([ bm25_weights.get(t, 0) for t in terms ])
multihot_qvec = np.array([ multihot_query_weights.get(t, 0) for t in terms ])

np.dot(multihot_qvec, bm25_vec)

The dot product is 17.949487686157227.

In the code snippet above, we're first creating numpy vectors for the document (bm25_vec) and the query (multihot_qvec). The dimensions of the vectors are the (unique) union of the terms from the query and the document. Then we use numpy's dot product method to compute the query-document score.

In the approach above, we perform the operations explicitly, but it's a bit roundabout. Alternatively we can compute the dot product as follows:

sum({term: bm25_weights[term] \
     for term in bm25_weights.keys() & \
     multihot_query_weights.keys()}.values())

Here, we are using a dictionary comprehension to generate a new dictionary that only contains the keys (terms) that appear in both the query and the document. Then we sum up the values (i.e., the term weights) to arrive at the query-document score. This is exactly the inner product of the multihot_qvec and bm25_vec vectors since multihot_qvec is a vector of zeros and ones. You should get exactly the same query-document score.

Let's try searching with the same query using Lucene:

from pyserini.search.lucene import LuceneSearcher

searcher = LuceneSearcher('indexes/lucene-index-msmarco-passage')
hits = searcher.search('what is paula deen\'s brother')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')

You'll see that hit 1, docid 7187158, has a score of 17.94950, which is the same as the score we computed by hand (modulo rounding).

Taking stock of where we're at: This section explains how retrieval with BM25 (e.g., with Lucene), is an instantiation of the bi-encoder architecture. Repeat the same exercise above for a few other queries (make them up) to convince yourself.

Transformers in Bi-Encoders

Let's go back and look at the bi-encoder architecture again.

With BM25, the document encoder generates sparse lexical (or bag-of-words) vectors. We can imagine encoders generating other types of vector representations.

The two major axes of variations are:

The basis of the representation vectors (sparse vs. dense). They could be sparse vectors (as in the case of BM25) or they could be dense vectors. In the case of dense vectors, the dimensions of the vectors putatively capture some "latent semantic space". Thus, we often contrast sparse lexical vector with dense semantic vectors.
Whether the representations are learned (supervised vs. unsupervised). In the case of BM25, there is no "learning" (modulo minor parameter tuning): the BM25 scoring function specifies the weights that each term should receive. BM25 was derived from a probabilistic model of retrieval and has evolved over many decades (see here), but it is not the product of a (supervised) machine-learning approach. Alternatively, document and query representations can be learned from large amounts of data. So, we have a contrast between unsupervised (or heuristic) representations and supervised (or learned) representations.

Indeed, dense retrieval models usually refer to encoders that learn to produce dense semantic representations using, you guessed it, transformers models! Where do the training data come from? Again, you got it: datasets such as MS MARCO.

So the picture for dense retrieval looks like this:

When people say "vector search" or "semantic search" these days, they're referring to the picture above. Often, the outputs of the encoders are called embedding vectors, or just embeddings for short.

Again, to summarize: We often say that dense retrieval models generate learned dense representations. In contrast, BM25 generates unsupervised (or heuristic) sparse representations. To complete the possibilities, yes, there are learned sparse representations and unsupervised dense representations as well.

We'll save a more complete exploration of this design space for some other time, but they're sketched out in this article (on which this guide is based) if you want to learn more:

Jimmy Lin. A Proposed Conceptual Framework for a Representational Approach to Information Retrieval. arXiv:2110.01529, October 2021.

Okay, that's it for this lesson. Next, you're going to play with an actual dense retrieval model. Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd, make sure you're using a commit id that's on the main trunk of Pyserini, and use its 7-hexadecimal prefix for the link anchor text.

Reproduction Log*

Results reproduced by @sahel-sh on 2023-07-23 (commit 9619502)
Results reproduced by @Mofetoluwa on 2023-08-05 (commit 6a2088b)
Results reproduced by @Andrwyl on 2023-08-26 (commit d9da49e)
Results reproduced by @yilinjz on 2023-08-30 (commit 42b3549)
Results reproduced by @UShivani3 on 2023-09-01 (commit 42b3549)
Results reproduced by @Edward-J-Xu on 2023-09-04 (commit 8063322)
Results reproduced by @mchlp on 2023-09-07 (commit d8dc5b3)
Results reproduced by @lucedes27 on 2023-09-10 (commit 54014af)
Results reproduced by @MojTabaa4 on 2023-09-14 (commit d4a829d)
Results reproduced by @Kshama on 2023-09-24 (commit 7d18f4b)
Results reproduced by @MelvinMo on 2023-09-24 (commit 7d18f4b)
Results reproduced by @ksunisth on 2023-09-27 (commit 142c774)
Results reproduced by @maizerrr on 2023-10-01 (commit bdb9504)
Results reproduced by @Stefan824 on 2023-10-04 (commit 4f3da10)
Results reproduced by @shayanbali on 2023-10-13 (commit f1d623c)
Results reproduced by @gituserbs on 2023-10-18 (commit f1d623c)
Results reproduced by @shakibaam on 2023-11-04 (commit 01889cc)
Results reproduced by @gitHubAndyLee2020 on 2023-11-05 (commit 01889cc)
Results reproduced by @Melissa1412 on 2023-11-05 (commit acd969f)
Results reproduced by @salinaria on 2023-11-12 (commit 086e16b)
Results reproduced by @Seun-Ajayi on 2023-11-13 (commit 086e16b)
Results reproduced by @oscarbelda86 on 2023-11-13 (commit 086e16b)
Results reproduced by @aliranjbari on 2023-11-14 (commit c344786)
Results reproduced by @AndreSlavescu on 2023-11-28 (commit 1219cdb)
Results reproduced by @tudou0002 on 2023-11-28 (commit 723e06c)
Results reproduced by @alimt1992 on 2023-11-29 (commit e6700f6)
Results reproduced by @golnooshasefi on 2023-11-29 (commit 1219cdb)
Results reproduced by @sueszli on 2023-12-01 (commit 170e271)
Results reproduced by @kdricci on 2023-12-01 (commit a2049c4)
Results reproduced by @ljk423 on 2023-12-04 (commit 35002ad)
Results reproduced by @saharsamr on 2023-12-14 (commit 039c137)
Results reproduced by @Panizghi on 2023-12-17 (commit 0f5db95)
Results reproduced by @AreelKhan on 2023-12-22 (commit f75adca)
Results reproduced by @wu-ming233 on 2023-12-31 (commit 38a571f)
Results reproduced by @Yuan-Hou on 2024-01-02 (commit 38a571f)
Results reproduced by @himasheth on 2024-01-10 (commit a6ed27e)
Results reproduced by @Tanngent on 2024-01-13 (commit 57a00cf)
Results reproduced by @BeginningGradeMaker on 2024-01-15 (commit d4ea011)
Results reproduced by @ia03 on 2024-01-18 (commit 05ee8ef)
Results reproduced by @AlexStan0 on 2024-01-20 (commit 833ee19)
Results reproduced by @charlie-liuu on 2024-01-23 (commit 87a120e)
Results reproduced by @dannychn11 on 2024-01-28 (commit 2f7702f)
Results reproduced by @ru5h16h on 2024-02-20 (commit 758eaaa)
Results reproduced by @ASChampOmega on 2024-02-23 (commit 442e7e1)
Results reproduced by @16BitNarwhal on 2024-02-26 (commit 19fcd3b)
Results reproduced by @HaeriAmin on 2024-02-27 (commit 19fcd3b)
Results reproduced by @17Melissa on 2024-03-03 (commit a9f295f)
Results reproduced by @devesh-002 on 2024-03-05 (commit 84c6742)
Results reproduced by @chloeqxq on 2024-03-07 (commit 19fcd3b)
Results reproduced by @xpbowler on 2024-03-11 (commit 19fcd3b)
Results reproduced by @jodyz0203 on 2024-03-12 (commit 280e009)
Results reproduced by @kxwtan on 2024-03-12 (commit 2bb342a)
Results reproduced by @syedhuq28 on 2024-03-28 (commit 2bb342a)
Results reproduced by @khufia on 2024-03-29 (commit 2bb342a)
Results reproduced by @Lindaaa8 on 2024-03-29 (commit 7dda9f3)
Results reproduced by @th13nd4n0 on 2024-04-05 (commit df3bc6c)
Results reproduced by @a68lin on 2024-04-12 (commit 7dda9f3)
Results reproduced by @DanielKohn1208 on 2024-04-22 (commit 184a212)
Results reproduced by @emadahmed19 on 2024-04-28 (commit 9db2584)
Results reproduced by @CheranMahalingam on 2024-05-05 (commit f817186)
Results reproduced by @billycz8 on 2024-05-08 (commit c945c50)
Results reproduced by @KenWuqianhao on 2024-05-11 (commit c945c50)
Results reproduced by @hrouzegar on 2024-05-13 (commit bf68fc5)
Results reproduced by @Yuv-sue1005 on 2024-05-14 (commit '9df4015')
Results reproduced by @RohanNankani on 2024-05-17 (commit a91ef1d)
Results reproduced by @IR3KT4FUNZ on 2024-05-25 (commit a6f4d6)
Results reproduced by ＠bilet-13 on 2024-06-01 (commit b0c53f3)
Results reproduced by ＠SeanSong25 on 2024-06-05 (commit b7e1da3)
Results reproduced by ＠alireza-taban on 2024-06-11 (commit d814290)
Results reproduced by ＠hosnahoseini on 2024-06-18 (commit 49d8c43)
Results reproduced by ＠Feng-12138 on 2024-06-23 (commit 3b9d541)
Results reproduced by @FaizanFaisal25 on 2024-07-06 (commit 3b9d541)
Results reproduced by @XKTZ on 2024-07-13 (commit 544046e)
Results reproduced by @MehrnazSadeghieh on 2024-07-19 (commit 26a2538)
Results reproduced by @alireza-nasirian on 2024-07-19 (commit 544046e)
Results reproduced by @MariaPonomarenko38 on 2024-07-19 (commit d4509dc)
Results reproduced by @valamuri2020 on 2024-08-03 (commit 3f81997)
Results reproduced by @daisyyedda on 2024-08-06 (commit d814290)
Results reproduced by @emily-emily on 2024-08-16 (commit 1bbf7a7)
Results reproduced by @nicoella on 2024-08-18 (commit e65dd95)
Results reproduced by @natek-1 on 2024-08-19 ( commit e65dd95)
Results reproduced by @setarehbabajani on 2024-08-31 (commit 0dd5fa7)
Results reproduced by @anshulsc on 2024-09-07 (commit 2e4fa5d)
Results reproduced by @r-aya on 2024-09-08 (commit 2e4fa5d)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conceptual-framework.md

conceptual-framework.md

Pyserini: A Conceptual Framework for Retrieval

Bi-Encoders

BM25 as a Bi-Encoder

Transformers in Bi-Encoders

Reproduction Log*

Files

conceptual-framework.md

Latest commit

History

conceptual-framework.md

File metadata and controls

Pyserini: A Conceptual Framework for Retrieval

Bi-Encoders

BM25 as a Bi-Encoder

Transformers in Bi-Encoders

Reproduction Log*