diff --git a/README.md b/README.md index e9ccf9c..546ba84 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ of document IDs. This allows us to quickly find documents containing a given word. More specifically, for each term we save a postings list as follows: -$$\text{n}\;|\;(\text{id}_i, f_i, [p_0, \dots, p_m]), \dots$$ +$$\text{n}\\;|\\;(\text{id}_i, f_i, [p_0, \dots, p_m]), \dots$$ Where $n$ is the number of documents the term appears in, id is the doc id, $f$ is the frequency, and $p_j$ are the positions where @@ -34,8 +34,8 @@ The vocabulary is written on disk using prefix compression. The idea is to sort terms and then write them as "matching prefix length", and suffix. Here is an example with three words: -$$\text{watermelon}\;\text{waterfall}\;\text{waterfront}$$ -$$0\;\text{watermelon}\;5\;\text{fall}\;6\;\text{ront}$$ +$$\text{watermelon}\\;\text{waterfall}\\;\text{waterfront}$$ +$$0\\;\text{watermelon}\\;5\\;\text{fall}\\;6\\;\text{ront}$$ Spelling correction is used before answering queries. Given a word $w$, we use a trigram index to find terms in the vocabulary @@ -44,8 +44,8 @@ We then select the one with the lowest [Levenshtein Distance](https://en.wikiped $$ \text{lev}(a, b) = \begin{cases} - |a| & \text{if}\;|b| = 0, \\ - |b| & \text{if}\;|a| = 0, \\ + |a| & \text{if}\\;|b| = 0, \\ + |b| & \text{if}\\;|a| = 0, \\ 1 + \text{min} \begin{cases} \text{lev}(\text{tail}(a), b) \\ \text{lev}(a, \text{tail}(b)) \\ @@ -57,11 +57,11 @@ $$ ### Query processing You can query the index with boolean or free test queries. In the first case you can use the usual boolean operators to compose a query, such as: -$$\text{gun}\;\text{AND}\;\text{control}$$ +$$\text{gun}\\;\text{AND}\\;\text{control}$$ In the second case, you just enter a phrase and receive a ranked collection of documents matching the query, ordered by [BM25 score](https://en.wikipedia.org/wiki/Okapi_BM25). -$$\text{BM25}(D, Q) = \sum_{i = 1}^{n} \; \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \Big (1 - b + b \cdot \frac{|D|}{\text{avgdl}} \Big )}$$ +$$\text{BM25}(D, Q) = \sum_{i = 1}^{n} \\; \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \Big (1 - b + b \cdot \frac{|D|}{\text{avgdl}} \Big )}$$ $$\text{IDF}(q_i) = \ln \Bigg ( \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1 \Bigg )$$ diff --git a/misc/web.png b/misc/web.png index 962fd2a..1eb1796 100644 Binary files a/misc/web.png and b/misc/web.png differ