Retrieving speed for large set of documents #25

StalVars · 2022-07-13T09:33:40Z

I found the retrieval very slow for ~ 20 million documents (wikipedia). Is it the case?

nashid · 2022-08-25T03:13:30Z

@StalVars whats the latency you got for ~400K documents? Also whats the memory usage?

ramsey-coding · 2022-08-27T07:55:11Z

@StalVars @Witiko I found the library to be super slow during retrieval from ~350K samples. Please help 🙏 🙏 🙏

ramsey-coding · 2022-08-27T07:59:39Z

@dorianbrown the library is slow to retrieval from ~350K samples. Can you please guide what to do here?

Witiko · 2022-08-27T08:30:09Z

You may want to use a library such as Gensim, which builds a dictionary mapping from words to ids and then indexes the documents using a sparse matrix. This makes indexing slower, but retrieval is much faster than rank-bm25, because it can use fast matrix operations: piskvorky/gensim#3304

Alternatively, use industry-strength packages such as pyserini or elasticsearch. Rank-bm25 is not built for speed.

ramsey-coding · 2022-08-27T08:52:54Z

thanks @Witiko. What's the downside of using pyserini?

Also what's the use case of Rank-bm25? Ease of use?

Witiko · 2022-08-27T09:02:17Z

Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file.

Witiko · 2022-08-27T09:02:28Z

Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file.

Witiko · 2022-08-27T09:03:44Z

Rank-bm25 is a simple solution for use cases, where speed is not a concern.

ramsey-coding · 2022-08-27T09:28:25Z

@Witiko thanks a lot for the feedback, really appreciate it 🙏

So there is no python solution that is fast?

ramsey-coding · 2022-08-27T09:29:59Z

Sounds like need to try with Gensim. But if the loading of documents are slow with Gensim, that's not a good fit for me either.

I need to <~250 ms response time during retrieval.

Witiko · 2022-08-27T09:36:08Z

Gensim is a pure Python solution that uses accelerated python libraries such as SciPy and NumPy. It is quite fast in the retrieval stage. Support for BM25 in Gensim is still experimental; you can install it as follows:

pip install git+https://github.com/witiko/gensim.git@feature/bm25

See piskvorky/gensim#3304 for an example of how you would use it. If you find it useful, please put a comment there, so that Gensim developers know that it is valuable to users and will merge the support for BM25 soon.

ramsey-coding · 2022-08-27T09:59:27Z

great. Would I get same retrieval accuracy with Gensim in comparison to rank_bm25?

ramsey-coding · 2022-08-27T10:19:00Z

@Witiko what would be the performance of Gensim during loading the 500k documents? Would that be competitive with rank_bm25?

Witiko · 2022-08-27T13:36:15Z

The algorithm is exactly the same aa rank-bm25, so accuracy should also be the same. The loading may be slightly slower than in rank-bm25, because we need to build a dictionary, but the retrieval should be significantly faster. Try it out and let me know.

ramsey-coding · 2022-08-27T21:32:15Z

@Witiko

Gensim is a pure Python solution that uses accelerated python libraries such as SciPy and NumPy. It is quite fast in the retrieval stage. Support for BM25 in Gensim is still experimental; you can install it as follows:
pip install git+https://github.com/witiko/gensim.git@feature/bm25
See RaRe-Technologies/gensim#3304 for an example of how you would use it. If you find it useful, please put a comment there, so that Gensim developers know that it is valuable to users and will merge the support for BM25 soon.

I ran the command pip install git+https://github.com/witiko/gensim.git@feature/bm25.

But it does not install and fails with the following error message:

      building 'gensim.models.nmf_pgd' extension
      gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -I/root/miniconda/envs/codex-env/include -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -fPIC -I/root/miniconda/envs/codex-env/include/python3.9 -I/root/miniconda/envs/codex-env/lib/python3.9/site-packages/numpy/core/include -c gensim/models/nmf_pgd.c -o build/temp.linux-x86_64-cpython-39/gensim/models/nmf_pgd.o
      gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -shared -Wl,-rpath,/root/miniconda/envs/codex-env/lib -Wl,-rpath-link,/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib -Wl,-rpath,/root/miniconda/envs/codex-env/lib -Wl,-rpath-link,/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib build/temp.linux-x86_64-cpython-39/gensim/models/nmf_pgd.o -o build/lib.linux-x86_64-cpython-39/gensim/models/nmf_pgd.cpython-39-x86_64-linux-gnu.so
      building 'gensim.similarities.fastss' extension
      creating build/temp.linux-x86_64-cpython-39/gensim/similarities
      gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -I/root/miniconda/envs/codex-env/include -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -fPIC -I/root/miniconda/envs/codex-env/include/python3.9 -I/root/miniconda/envs/codex-env/lib/python3.9/site-packages/numpy/core/include -c gensim/similarities/fastss.c -o build/temp.linux-x86_64-cpython-39/gensim/similarities/fastss.o
      gensim/similarities/fastss.c: In function ‘ceditdist’:
      gensim/similarities/fastss.c:725:9: error: ‘for’ loop initial declarations are only allowed in C99 mode
               for (WIDTH tmpi = 0; tmpi <= len_s1; tmpi++) row2[tmpi] = tmpi;
               ^
      gensim/similarities/fastss.c:725:9: note: use option -std=c99 or -std=gnu99 to compile your code
      gensim/similarities/fastss.c:727:9: error: ‘for’ loop initial declarations are only allowed in C99 mode
               for (WIDTH i2 = 0; i2 < len_s2; i2++) {
               ^
      gensim/similarities/fastss.c:738:13: error: ‘for’ loop initial declarations are only allowed in C99 mode
                   for (WIDTH i1 = 0; i1 < len_s1; i1++) {
                   ^
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> gensim

Witiko · 2022-08-27T21:45:57Z

I am not sure what the issue is. It works fine in the python:3.7 Docker image.

ramsey-coding · 2022-08-27T21:49:01Z

@Witiko when I install like pip install --upgrade gensim, it just works. Looks like an issue with the https://github.com/witiko/gensim.git@feature/bm25 branch.

Thanks for the feedback. I will try with python:3.7

Witiko · 2022-08-27T21:52:00Z

@Witiko when I install like pip install --upgrade gensim, it just works. Looks like an issue with the https://github.com/witiko/gensim.git@feature/bm25 branch.

It's not an issue with the branch. The release version of gensim has a precompiled wheel, which circumvents your compiler.

jankovicsandras · 2024-10-07T11:49:54Z

A little tangential, but I found another interesting speed issue. I made a refactored/simplified version of BM25Okapi from rank_bm25 to https://github.com/jankovicsandras/plpgsql_bm25/blob/main/mybm25okapi.py
This is without numpy (!), using just math and precalculates stuff in __init__ to simplify the query function. Still it's 2-4x faster than rank_bm25 in this quick-and-dirty test, so it might be possible to speed up rank_bm25 a lot.

https://github.com/jankovicsandras/plpgsql_bm25/blob/main/plpgsql_bm25_comparison_with_paradedb_pg_search.ipynb

(E.g. it's possible to compute half of the lines 119-120 in rank_bm25.py beforehand.

     score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))

)

dorianbrown · 2024-10-07T20:19:30Z

Sounds interesting, I haven't had a close look at the changes yet, but basically you're just precalculating all the static terms from the scoring function?

And I guess since the initial stuff was done with the math functions, vectorizing it with numpy would result in larger speed gains? Sounds feasible and promising, I'll try and double check the change when I've got some time, and run some tests. But feel free to create a PR if you're interested!

jankovicsandras · 2024-10-09T06:10:50Z

Thanks for your answer! 😊 I opened a PR: #46

jankovicsandras · 2024-10-24T09:45:11Z

I made an optimized rewrite with all 3 algorithms: https://github.com/jankovicsandras/bm25opt

Comparative testing shows it runs approx 30-40 x faster than rank_bm25 while producing exactly same scores.

Witiko mentioned this issue Aug 29, 2022

Implement Okapi BM25 variants in Gensim piskvorky/gensim#3304

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieving speed for large set of documents #25

Retrieving speed for large set of documents #25

StalVars commented Jul 13, 2022

nashid commented Aug 25, 2022

ramsey-coding commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022

Witiko commented Aug 27, 2022

Witiko commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 •

edited

Loading

jankovicsandras commented Oct 7, 2024

dorianbrown commented Oct 7, 2024 •

edited

Loading

jankovicsandras commented Oct 9, 2024

jankovicsandras commented Oct 24, 2024

Retrieving speed for large set of documents #25

Retrieving speed for large set of documents #25

Comments

StalVars commented Jul 13, 2022

nashid commented Aug 25, 2022

ramsey-coding commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 • edited Loading

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022

Witiko commented Aug 27, 2022

Witiko commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 • edited Loading

ramsey-coding commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 • edited Loading

ramsey-coding commented Aug 27, 2022 • edited Loading

Witiko commented Aug 27, 2022 • edited Loading

ramsey-coding commented Aug 27, 2022

Witiko commented Aug 27, 2022 • edited Loading

jankovicsandras commented Oct 7, 2024

dorianbrown commented Oct 7, 2024 • edited Loading

jankovicsandras commented Oct 9, 2024

jankovicsandras commented Oct 24, 2024

Witiko commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

ramsey-coding commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

Witiko commented Aug 27, 2022 •

edited

Loading

dorianbrown commented Oct 7, 2024 •

edited

Loading