Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add retrieval example #6193

Merged
merged 21 commits into from
Mar 25, 2024
Merged

Conversation

mscheong01
Copy link
Collaborator

@mscheong01 mscheong01 commented Mar 21, 2024

resolves #5692

  • added retrieval example
  • added three llama parameters:
    • --context-files: multiple files that are to-be embedded
    • --chunk-size: minimum size of each text chunk to be embedded
    • --chunk-separator: STRING to divide chunks by. newline by default
      I don't think I put enough thought into naming these parameters, so proposals are welcome

the new retrieval example can be tested as below

make -j && ./retrieval --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator . -ub 2048

which chunks & embeds all given files and starts a loop requesting query inputs:

Enter query: 

and on query input, top k chunks are shown along with file name, chunk position within file and original text

Enter query: describe the mit license
batch_decode: n_tokens = 6, n_seq = 1
Top 3 similar chunks:
filename: README.md
filepos: 119
similarity: 0.762334
textdata:
png)

[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)

[Roadmap](https://github.
--------------------
filename: License
filepos: 0
similarity: 0.725146
textdata:
MIT License

Copyright (c) 2023 Georgi Gerganov

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
--------------------
filename: README.md
filepos: 9178
similarity: 0.621722
textdata:
com/cztomsik/ava) (MIT)
- [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal)
- [pythops/tenere](https://github.
--------------------

+)
I'd like opinion on two subjects regarding the chunking functionality

  • Should '\n'(newline) always work as a chunk separator?
  • Should we support multiple custom chunk separators? (ex: --chunk-separator . 。 chunks input by either . or 。)

+) (help wanted)
I have an ongoing problem rn where the below assertion fails when -ub 2048 option is not given. It seems related to the recent parallelism support which I haven't had the chance to look deeply into, so any advice about how to solve this is very welcome.

GGML_ASSERT: llama.cpp:9031: (cparams.causal_attn || cparams.n_ubatch >= n_tokens_all) && "non-causal attention requires n_ubatch >= n_tokens"

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the question about batch and u_batch, let's ask @ggerganov

common/common.cpp Outdated Show resolved Hide resolved
common/common.cpp Outdated Show resolved Hide resolved
common/common.cpp Outdated Show resolved Hide resolved
examples/retrieval/retrieval.cpp Outdated Show resolved Hide resolved
examples/retrieval/retrieval.cpp Outdated Show resolved Hide resolved
examples/retrieval/retrieval.cpp Outdated Show resolved Hide resolved
examples/retrieval/retrieval.cpp Outdated Show resolved Hide resolved
examples/retrieval/retrieval.cpp Outdated Show resolved Hide resolved
for (auto & chunk : chunks) {
auto inp = ::llama_tokenize(ctx, chunk.textdata, true, false);
if (inp.size() > n_batch) {
inp.resize(n_batch);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure about this line. Will it drops tokens if inp.size() > n_batch? If that's the case, that means we drop information from embedding, which will cause the output embedding to become inaccurate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'll have to restrict the chunk size to fit the n_batch parameter somehow. This presents the identical issue to setting a configurable maximum chunk length: How should we handle situations where separators fail to appear within the maximum length?

Copy link
Collaborator

@ngxson ngxson Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe reversed, we can force the user to give sufficient n_batch (and n_ubatch) before running the app. AFAIK that's because with embeddings, we use non-causal models the requires prompt to be processed in one batch.

My plan is maybe after tokenize all chunks, you can see if any chunk have tokens.size() > n_batch, then raise an error and exit the program.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can see if any chunk have tokens.size() > n_batch, then raise an error and exit the program.

This seems like the best option for now 👍

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it make more sense to determine params.n_batch based on the largest chunk after tokenization? It should not be a user-provided parameter in this example - just set it to the largest chunk size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that can be a solution. I'm just not sure if that requires re-creating a new llama_context, because the prior llama_init_from_gpt_params call already used params.n_batch and params.n_ubatch

@ngxson
Copy link
Collaborator

ngxson commented Mar 21, 2024

Also maybe you can look at the implementation of langchain's CharacterTextSplitter, which also support overlap chunks.

We can use single separator for now, assuming that the input dataset is single-language.

However the overlap chunk is quite important I think, as it yield a better result in RAG.

@mscheong01
Copy link
Collaborator Author

Hey @ngxson, thanks for the feedback! 😄 Just pushed some quick updates before hitting the hay. I'll get to the rest tomorrow.

Also maybe you can look at the implementation of langchain's CharacterTextSplitter, which also support overlap chunks.
We can use single separator for now, assuming that the input dataset is single-language.
However the overlap chunk is quite important I think, as I yield a better result in RAG.

Totally agree with aiming for a comprehensive RAG example down the line. But for now, based on @ggerganov's insights in #5692, it seemed best to keep things simple.

@ngxson
Copy link
Collaborator

ngxson commented Mar 21, 2024

I mentioned CharacterTextSplitter because the split behavior that we're having now is very similar to that. I agree that having overlap is not very important at this stage, but I still consider it to be a simple thing but yet effective, so maybe we'll add this in another PR.

I'm also currently working on a retrieval example on wllama. During implementation, I noticed that some models need prefix for document embedding and query embedding. Maybe not really important to care about at this stage, but I just list it here so we don't forget.

For example, with bge models, you need to add Represent this sentence for searching relevant passages: to the query (no need for document). nomic model requires search_query: for query and search_document: for document. There may also be differences in BOS/EOS/CLS tokens placement, but I haven't got time to look into.

Also here is my test dataset if you need. It's just wiki page of Albert Einstein, but the data is quite varied, so that will be a very real-life test case.

@mscheong01 mscheong01 requested a review from ngxson March 23, 2024 05:33
@phymbert
Copy link
Collaborator

For the question about batch and u_batch, let's ask @ggerganov

For embeddings, --ubatch-size must be greater than bert.context_length and --batch-size equals to --ubatch-size, see server embbedings feature.

@mscheong01
Copy link
Collaborator Author

mscheong01 commented Mar 23, 2024

For the question about batch and u_batch, let's ask @ggerganov

For embeddings, --ubatch-size must be greater than bert.context_length and --batch-size equals to --ubatch-size, see server embbedings feature.

Thanks for the information 🙇‍♂️ , since batch_size defaults to 2048, which is larger than ubatch_size at 512, I'll set ubatch_size equal to batch_size right from the start so that we don't have explicitly provide the option every time.
image

+) IIUC, we should apply the same approach to the embedding example, right? 🤔

@ngxson
Copy link
Collaborator

ngxson commented Mar 23, 2024

Yes, the embedding.cpp example is a bit out-dated I think, for now let's set n_ubatch == n_batch for simplification.

@mscheong01
Copy link
Collaborator Author

I'll get the embedding example fixed likewise after this is merged 😃

common/common.h Outdated Show resolved Hide resolved
@ggerganov ggerganov merged commit 64e7b47 into ggerganov:master Mar 25, 2024
50 of 55 checks passed
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* add `retrieval` example

* add README

* minor fixes

* cast filepos on print

* remove use of variable sized array

* store similarities in separate vector

* print error on insufficient batch size

* fix error message printing

* assign n_batch value to n_ubatch

* fix param definitions

* define retrieval-only parameters in retrieval.cpp

* fix `--context-file` option to be provided multiple times for multiple files

* use vector for `query_emb`

* add usage description in README

* fix merge conflict

* fix usage printing

* remove seed setting

* fix lint

* increase file read buffer size

* retrieval : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
* add `retrieval` example

* add README

* minor fixes

* cast filepos on print

* remove use of variable sized array

* store similarities in separate vector

* print error on insufficient batch size

* fix error message printing

* assign n_batch value to n_ubatch

* fix param definitions

* define retrieval-only parameters in retrieval.cpp

* fix `--context-file` option to be provided multiple times for multiple files

* use vector for `query_emb`

* add usage description in README

* fix merge conflict

* fix usage printing

* remove seed setting

* fix lint

* increase file read buffer size

* retrieval : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
* add `retrieval` example

* add README

* minor fixes

* cast filepos on print

* remove use of variable sized array

* store similarities in separate vector

* print error on insufficient batch size

* fix error message printing

* assign n_batch value to n_ubatch

* fix param definitions

* define retrieval-only parameters in retrieval.cpp

* fix `--context-file` option to be provided multiple times for multiple files

* use vector for `query_emb`

* add usage description in README

* fix merge conflict

* fix usage printing

* remove seed setting

* fix lint

* increase file read buffer size

* retrieval : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

llama : add retrieval example
4 participants