add `retrieval` example #6193

mscheong01 · 2024-03-21T08:10:47Z

resolves #5692

added retrieval example
added three llama parameters:
- --context-files: multiple files that are to-be embedded
- --chunk-size: minimum size of each text chunk to be embedded
- --chunk-separator: STRING to divide chunks by. newline by default
  I don't think I put enough thought into naming these parameters, so proposals are welcome

the new retrieval example can be tested as below

make -j && ./retrieval --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator . -ub 2048

which chunks & embeds all given files and starts a loop requesting query inputs:

Enter query:

and on query input, top k chunks are shown along with file name, chunk position within file and original text

Enter query: describe the mit license
batch_decode: n_tokens = 6, n_seq = 1
Top 3 similar chunks:
filename: README.md
filepos: 119
similarity: 0.762334
textdata:
png)

[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)

[Roadmap](https://github.
--------------------
filename: License
filepos: 0
similarity: 0.725146
textdata:
MIT License

Copyright (c) 2023 Georgi Gerganov

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
--------------------
filename: README.md
filepos: 9178
similarity: 0.621722
textdata:
com/cztomsik/ava) (MIT)
- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal)
- [pythops/tenere](https://github.
--------------------

+)
I'd like opinion on two subjects regarding the chunking functionality

Should '\n'(newline) always work as a chunk separator?
Should we support multiple custom chunk separators? (ex: --chunk-separator . 。 chunks input by either . or 。)

+) (help wanted)
I have an ongoing problem rn where the below assertion fails when -ub 2048 option is not given. It seems related to the recent parallelism support which I haven't had the chance to look deeply into, so any advice about how to solve this is very welcome.

GGML_ASSERT: llama.cpp:9031: (cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens"

ngxson

For the question about batch and u_batch, let's ask @ggerganov

common/common.cpp

examples/retrieval/retrieval.cpp

ngxson · 2024-03-21T11:14:29Z

examples/retrieval/retrieval.cpp

+    for (auto & chunk : chunks) {
+        auto inp = ::llama_tokenize(ctx, chunk.textdata, true, false);
+        if (inp.size() > n_batch) {
+            inp.resize(n_batch);


I'm not quite sure about this line. Will it drops tokens if inp.size() > n_batch? If that's the case, that means we drop information from embedding, which will cause the output embedding to become inaccurate.

I guess we'll have to restrict the chunk size to fit the n_batch parameter somehow. This presents the identical issue to setting a configurable maximum chunk length: How should we handle situations where separators fail to appear within the maximum length?

Or maybe reversed, we can force the user to give sufficient n_batch (and n_ubatch) before running the app. AFAIK that's because with embeddings, we use non-causal models the requires prompt to be processed in one batch.

My plan is maybe after tokenize all chunks, you can see if any chunk have tokens.size() > n_batch, then raise an error and exit the program.

you can see if any chunk have tokens.size() > n_batch, then raise an error and exit the program.

This seems like the best option for now 👍

Doesn't it make more sense to determine params.n_batch based on the largest chunk after tokenization? It should not be a user-provided parameter in this example - just set it to the largest chunk size

Yeah that can be a solution. I'm just not sure if that requires re-creating a new llama_context, because the prior llama_init_from_gpt_params call already used params.n_batch and params.n_ubatch

ngxson · 2024-03-21T11:20:31Z

Also maybe you can look at the implementation of langchain's CharacterTextSplitter, which also support overlap chunks.

We can use single separator for now, assuming that the input dataset is single-language.

However the overlap chunk is quite important I think, as it yield a better result in RAG.

mscheong01 · 2024-03-21T16:42:16Z

Hey @ngxson, thanks for the feedback! 😄 Just pushed some quick updates before hitting the hay. I'll get to the rest tomorrow.

Also maybe you can look at the implementation of langchain's CharacterTextSplitter, which also support overlap chunks.
We can use single separator for now, assuming that the input dataset is single-language.
However the overlap chunk is quite important I think, as I yield a better result in RAG.

Totally agree with aiming for a comprehensive RAG example down the line. But for now, based on @ggerganov's insights in #5692, it seemed best to keep things simple.

ngxson · 2024-03-21T17:48:41Z

I mentioned CharacterTextSplitter because the split behavior that we're having now is very similar to that. I agree that having overlap is not very important at this stage, but I still consider it to be a simple thing but yet effective, so maybe we'll add this in another PR.

I'm also currently working on a retrieval example on wllama. During implementation, I noticed that some models need prefix for document embedding and query embedding. Maybe not really important to care about at this stage, but I just list it here so we don't forget.

For example, with bge models, you need to add Represent this sentence for searching relevant passages: to the query (no need for document). nomic model requires search_query: for query and search_document: for document. There may also be differences in BOS/EOS/CLS tokens placement, but I haven't got time to look into.

Also here is my test dataset if you need. It's just wiki page of Albert Einstein, but the data is quite varied, so that will be a very real-life test case.

phymbert · 2024-03-23T11:34:26Z

For the question about batch and u_batch, let's ask @ggerganov

For embeddings, --ubatch-size must be greater than bert.context_length and --batch-size equals to --ubatch-size, see server embbedings feature.

mscheong01 · 2024-03-23T12:36:55Z

For the question about batch and u_batch, let's ask @ggerganov

For embeddings, --ubatch-size must be greater than bert.context_length and --batch-size equals to --ubatch-size, see server embbedings feature.

Thanks for the information 🙇‍♂️ , since batch_size defaults to 2048, which is larger than ubatch_size at 512, I'll set ubatch_size equal to batch_size right from the start so that we don't have explicitly provide the option every time.

+) IIUC, we should apply the same approach to the embedding example, right? 🤔

ngxson · 2024-03-23T13:07:38Z

Yes, the embedding.cpp example is a bit out-dated I think, for now let's set n_ubatch == n_batch for simplification.

mscheong01 · 2024-03-23T13:19:01Z

I'll get the embedding example fixed likewise after this is merged 😃

common/common.h

examples/retrieval/retrieval.cpp

examples/retrieval/README.md

examples/retrieval/retrieval.cpp

…e files

* add `retrieval` example * add README * minor fixes * cast filepos on print * remove use of variable sized array * store similarities in separate vector * print error on insufficient batch size * fix error message printing * assign n_batch value to n_ubatch * fix param definitions * define retrieval-only parameters in retrieval.cpp * fix `--context-file` option to be provided multiple times for multiple files * use vector for `query_emb` * add usage description in README * fix merge conflict * fix usage printing * remove seed setting * fix lint * increase file read buffer size * retrieval : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

add retrieval example

eb760a9

mscheong01 force-pushed the add-`retrieval`-example branch from 3f7be5c to eb760a9 Compare March 21, 2024 08:15

add README

50f9967

ngxson requested changes Mar 21, 2024

View reviewed changes

minor fixes

751787e

mscheong01 added 4 commits March 22, 2024 13:08

cast filepos on print

c1e5575

remove use of variable sized array

d33a015

store similarities in separate vector

208e1f0

print error on insufficient batch size

789a194

mscheong01 requested a review from ngxson March 23, 2024 05:33

fix error message printing

7d819d0

mscheong01 added 2 commits March 23, 2024 21:41

assign n_batch value to n_ubatch

e16279e

fix param definitions

f9dc033

ngxson reviewed Mar 23, 2024

View reviewed changes

common/common.h Outdated Show resolved Hide resolved

This was referenced Mar 23, 2024

server: docs: --threads and --threads, --ubatch-size, --log-disable #6254

Merged

server: exit failure if --embedding is set with an incoherent --ubatch-size #6263

Open

define retrieval-only parameters in retrieval.cpp

56b7db9