Concept search powered by Large Language Models

In this repository, I explore the Idea of a concept search with Large Language Models.
Instead of an exact word match or an intelligent full-text search, I search the text for representing a concept or Idea.

In this experiment, the Bible is searched for the concept of seeking discomfort.
For this, I go through every verse and let a large Language Model rate how much this verse represents the concept of seeking discomfort. With this technique, I will also discover verses like Matthew 7:13-14

Enter through the narrow gate. For wide is the gate and broad is the road that leads to destruction, and many enter through it. But small is the gate and narrow the road that leads to life, and only a few find it.

I think this verse represents the Idea of seeking discomfort well, but there is no intelligent search available, that would be able to find this.

However, the main purpose of this repository is to experiment with Large Language models and different API

Concept search powered by Large Language Models

Comparison of the models

Tested LLMS:

Google VertexAi:
- PaLM 2
- GeminiPro
OpenAi
- GPT 3.5
- GPT 4
mistral.ai
- mistral-tiny (mistral 7B)
- mistral-small (mixtral 8x7B)
- mistral-medium
local (with Ollama)
- mistral 7B
- Llama 7B
- Llama 13B

Cost

Here is a comparison of the costs for the tested LLMS models:

Speed

The Google Cloud API gave the fastest answers.
For the local models run with Ollama, Llama 13B took the longest. The reason for this nonlinear time increase is that the 13B parameter model is too big to fit into the 8 GB VRAM on my GTX 1070. Therefore some of the Inference is delegated to RAM and CPU, which is much slower. The Mistral tiny result looks like an outlier compared to the other APIs Mistral.ai offers. Subsequent tests show, that the Mistral tiny API is usually much faster.

Performance

Below is a table of the top ratings sorted by mistral medium, because based on the public stats, it should be the most powerful model. Based on this view, it looks like, that llama7B is the closest match to mistral medium, but this is not true, as llama7B gives a rating of 8,7 or 6 to every verse. Another thing, that is interesting, is that the ratings of mistral-tiny and Ollama mistral7B are different, even though they should be the same underlying model.

id	mistral medium	mistral small	mistral tiny	ollama llama13B	ollama llama7B	ollama mistral7B	gpt 3.5	gemini pro	text bison
551	9	9	1	8	8	10	8	8	8
133	8	2	1	5	8	10	5	8	5
548	8	0	1	-1	8	10	5	2	2
119	8	1	8	5	7	0	5	7	5
220	8	9	1	-1	7	10	5	0	0
666	8	0	0	-1	8	0	0	0	0
310	8	9	8	5	8	5	5	7	5
311	8	9	5	5	8	5	5	7	5
130	8	8	0	5	7	0	-1	3	5
552	8	9	8	5	7	5	5	5	0
946	8	2	8	5	7	5	2	0	0
1010	8	0	1	5	7	10	0	0	0
777	8	2	1	5	7	10	2	0	0
384	8	1	0	-1	7	0	2	0	0
659	8	2	8	-1	7	2	2	0	0
807	8	2	1	-1	7	10	2	2	5
646	8	8	1	8	8	5	-1	2	0
638	8	9	8	-1	8	8	8	2	3
118	8	2	1	5	8	10	5	2	5
440	8	8	8	5	8	0	8	0	0

Reliability

In the table where there is a -1 or less, the model failed to output a number with its first 2 tokens.

General Findings

Performance (Speed)

Here are a few things I have done to improve the speed and reduce cost: limit the output token to just 2 and tell the LLM to only output. This results in a zero-shot reasoning task. Use a system message to give the instructions to the LLM. The theory behind that is that the LLM can precompute part of the Attention matrix because the system message does not change. This only works, if the LLM is set up to cache the pre-computed attention matrix

Prompt Engineering

I use the following system prompt:

"""You are an expert in the understanding and interpretation of the English language. You will get a single verse of the bible as an input and respond only with a single number from 0 to 10, representing how much the given bible verse says about the concept of seeking discomfort. It is irrelevant if the verse suggests that seeking discomfort is good or bad. For every prompt you will answer with just one number (0 to 10). 0 represents the lowest correlation to the concept of seeking comfort or seeking discomfort and 10 represents the highest correlation to the concept of seeking comfort or seeking discomfort. Only answer with a single number. Do NOT give any explanation or context. """

I prime the model with "You are an expert in the use of the English language" to make sure, it focuses just on the interpretation of verse and not on its prior knowledge of the bible or Christianity.

scoring to 100 does not work

None of the tested models ever scored a score other than a multiple of 10. I expected the model to first predict the first digit and then finetune its answer with the second token. This may be because all multiples of 10 consist of just one token.
In fact, with OpenAIs Tokenizer, all numbers from 0 to 999 consist of only one token. This may mean, that it makes more sense to just score from 0 to 10.

Authentication

Google Cloud

Create a Service account https://console.cloud.google.com/iam-admin/serviceaccounts?hl=en
add the role for Vertex-AI-User
open service account
select KEYS
Create New Key Then the credentials.json file is downloaded
Save it somewhere
Then put the path to that file into the Environment Variable

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'C:/Git_Repos/try-out-vertex-ai-20aaf5516b7c.json' 
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/mnt/c/Git_Repos/try-out-vertex-ai-20aaf5516b7c.json'

OpenAI

get the API key from https://platform.openai.com/api-keys and then add it to your environment.

os.environ["OPENAI_API_KEY"] = "your-api-key-here"

Lessons learned

Safety

Their safety is annoying!

Google safety setting https://ai.google.dev/tutorials/python_quickstart#safety_settings You can remove the blocking by setting safety settings:
https://stackoverflow.com/questions/77723993/gemini-pro-api-blocking-replies

Rate Limiting

I get rate-limited on PaLM 2 and GeminiPro. maximum of 60 Requests per minute https://cloud.google.com/vertex-ai/docs/generative-ai/quotas-genai?hl=en
I could request to up the limit in the meantime, the limit was lifted

on the Gemini Pro requests, I randomly get some annoying grpc error.

Bigger is not always better

Llama 7B was better at using my pre-defined response template of only responding with the number.
Llma 13B however, tried to answer in a full sentence ignoring my request.

APIs for bigger models are faster

The runtime comparison graph shows that for all the APIs tested the bigger models are faster.
Gemini Pro is faster than Palm2, GPT 4 is faster than GPT 3.5 and Mistral medium is faster than Mistral small. This is counterintuitive because, on the same Hardware, bigger models should take longer to compute.
This shows, that companies may put more powerful graphic cards and more compute behind their bigger models.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
images		images
ollama_modelfile		ollama_modelfile
setup_scripts		setup_scripts
.gitignore		.gitignore
README.md		README.md
experiment_notebook.ipynb		experiment_notebook.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concept search powered by Large Language Models

Comparison of the models

Cost

Speed

Performance

Reliability

General Findings

Performance (Speed)

Prompt Engineering

scoring to 100 does not work

Authentication

Google Cloud

OpenAI

Lessons learned

Safety

Rate Limiting

Bigger is not always better

APIs for bigger models are faster

About

Languages

walkerdustin/genAI-concept-search

Folders and files

Latest commit

History

Repository files navigation

Concept search powered by Large Language Models

Comparison of the models

Cost

Speed

Performance

Reliability

General Findings

Performance (Speed)

Prompt Engineering

scoring to 100 does not work

Authentication

Google Cloud

OpenAI

Lessons learned

Safety

Rate Limiting

Bigger is not always better

APIs for bigger models are faster

About

Topics

Resources

Stars

Watchers

Forks

Languages