high score with empty document string #543

jdongca2003 · 2024-03-15T14:56:00Z

jdongca2003
Mar 15, 2024

When qdrant is used for vector embedding indexing, an empty document in the collection will obtain a high similarity score (strange).

Here we can observe that the empty document strange achieves "0.797" cosine similarity score. If the document is empty, I assume that zero vector is used. cosine similarity score should be 0. Can you help ?

e.g.

from typing import List
import numpy as np
from qdrant_client import QdrantClient

documents: List[str] = [
"",
"email address",
"placeholder",
"",
"wireless customer",
"He died in 1597 at the age of 57",
"Maharana Pratap is considered a symbol of Rajput resistance against foreign rule",
"He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar",
"total active lines",
""
]

client = QdrantClient(":memory:")
client.set_model("BAAI/bge-small-en")
metadata = [ {"source": "docs"} for doc in documents]
ids = [ idx for idx in range(len(documents))]

client.add(
collection_name="demo_collection",
documents=documents,
metadata=metadata,
ids=ids
)

query = "Count the number of active residential customer"
search_result = client.query(
collection_name = "demo_collection",
query_text = query,
limit= 5)
print(search_result)

Results:

[QueryResponse(id=8, embedding=None, metadata={'document': 'total active lines', 'source': 'docs'}, document='total active lines', score=0.8843179222400359), QueryResponse(id=4, embedding=None, metadata={'document': 'wireless customer', 'source': 'docs'}, document='wireless customer', score=0.8295176016136243), QueryResponse(id=1, embedding=None, metadata={'document': 'email address', 'source': 'docs'}, document='email address', score=0.8228306079924803), QueryResponse(id=2, embedding=None, metadata={'document': 'placeholder', 'source': 'docs'}, document='placeholder', score=0.8144248465983718), QueryResponse(id=9, embedding=None, metadata={'document': '', 'source': 'docs'}, document='', score=0.7972171966992909)]

joein · 2024-03-15T15:27:43Z

joein
Mar 15, 2024
Collaborator

Hi @jdongca2003

I don't think that if the string is empty vector should be zero, it actually depends on the model you are using.

Could you please check whether you have a similar result with the original BAAI/bge-small-en model ? (What I mean is: take a model from huggingface, compute the embeddings manually for your documents and check whether the situation is the same)

0 replies

jdongca2003 · 2024-03-15T15:54:59Z

jdongca2003
Mar 15, 2024
Author

Thank Joein for quick response. I checked embedding vector of empty document. It is not zero vector! But it is still not a good behavior.

from typing import List
import numpy as np
from fastembed import TextEmbedding
import json

documents: List[str] = [
    "",
    "email address",
    "placeholder",
    "",
    "wireless customer",
    "He died in 1597 at the age of 57",
    "Maharana Pratap is considered a symbol of Rajput resistance against foreign rule",
    "He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar",
    "total active lines",
    ""
]

embedding_model = TextEmbedding(model_name="BAAI/bge-small-en", max_length=512)

embeddings: List[np.ndarray] = list(
    embedding_model.passage_embed(documents)
)  # notice that we are casting the generator to a list


#print(embeddings[0].shape, len(embeddings))

query = "Count the number of active residential customer"
query_embedding = list(embedding_model.query_embed(query))[0]

def print_top_k(query_embedding, embeddings, documents, k=5):
    # use numpy to calculate the cosine similarity between the query and the documents
    scores = np.dot(embeddings, query_embedding)
    for score, doc in zip(scores, documents):
        print(f'{doc}|score: {score}')
    # sort the scores in descending order
    sorted_scores = np.argsort(scores)[::-1]
    # print the top 5
    #for i in range(k):
    #    print(f"score: {scores[sorted_scores[i]]} Rank {i+1}: {documents[sorted_scores[i]]}")

print_top_k(query_embedding, embeddings, documents, k=5)

I directly calculated cosine similarity score.
|score: 0.7972172498703003
email address|score: 0.8228306174278259
placeholder|score: 0.814424991607666
|score: 0.7972172498703003
wireless customer|score: 0.8295177221298218
He died in 1597 at the age of 57|score: 0.7157479524612427
Maharana Pratap is considered a symbol of Rajput resistance against foreign rule|score: 0.7073748111724854
He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar|score: 0.7003960609436035

embedding vector for empty document:

dim = 384
[-2.53819916e-02 -5.44682052e-03 -5.09282853e-03 -1.49776395e-02
-1.08098146e-02 1.19938692e-02 1.92262717e-02 4.08581644e-02
-9.28279664e-03 1.56196468e-02 1.86153606e-03 -4.88135368e-02
6.96400367e-03 3.49483788e-02 3.50163616e-02 4.01080912e-03
3.18448767e-02 1.36998445e-02 -1.56665053e-02 1.64450370e-02
2.16239858e-02 -1.99406147e-02 1.17815230e-02 -1.80905703e-02
4.76054614e-03 2.72297114e-02 -5.90159511e-03 -8.18434451e-03
-4.85137738e-02 -1.91728160e-01 -3.33202034e-02 -1.37138087e-02
3.19078634e-03 -9.87244491e-03 -1.03822276e-02 -9.70588345e-03
-1.62116215e-02 1.38158510e-02 -1.09591316e-02 4.05766815e-02
2.16749441e-02 1.38471741e-02 -1.54241202e-02 -1.06100161e-02
5.69914840e-03 -2.26438437e-02 -1.67865120e-02 -6.69355411e-03
5.80454506e-02 -6.32909359e-03 2.05236953e-03 1.03720073e-02 ...

0 replies

joein · 2024-03-15T16:14:37Z

joein
Mar 15, 2024
Collaborator

Could you elaborate on not a good behavior, what do you mean exactly?

0 replies

jdongca2003 · 2024-03-15T16:21:29Z

jdongca2003
Mar 15, 2024
Author

I mean that a good scoring behavior is that a low score is for empty document when the query is a natural text.

0 replies

joein · 2024-03-15T17:21:31Z

joein
Mar 15, 2024
Collaborator

Unfortunately, we can't do anything about this, Qdrant provides a way to operate with embeddings, it can't do anything with the embedding values.
Embedding values are determined by the model you've chosen.

0 replies

jdongca2003 · 2024-03-15T19:18:58Z

jdongca2003
Mar 15, 2024
Author

Thank joein. Your clarification is very reasonable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high score with empty document string #543

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

high score with empty document string #543

jdongca2003 Mar 15, 2024

Replies: 6 comments

joein Mar 15, 2024 Collaborator

jdongca2003 Mar 15, 2024 Author

joein Mar 15, 2024 Collaborator

jdongca2003 Mar 15, 2024 Author

joein Mar 15, 2024 Collaborator

jdongca2003 Mar 15, 2024 Author

jdongca2003
Mar 15, 2024

joein
Mar 15, 2024
Collaborator

jdongca2003
Mar 15, 2024
Author

joein
Mar 15, 2024
Collaborator

jdongca2003
Mar 15, 2024
Author

joein
Mar 15, 2024
Collaborator

jdongca2003
Mar 15, 2024
Author