high score with empty document string #543
Replies: 6 comments
-
Hi @jdongca2003 I don't think that if the string is empty vector should be zero, it actually depends on the model you are using. Could you please check whether you have a similar result with the original BAAI/bge-small-en model ? (What I mean is: take a model from huggingface, compute the embeddings manually for your documents and check whether the situation is the same) |
Beta Was this translation helpful? Give feedback.
-
Thank Joein for quick response. I checked embedding vector of empty document. It is not zero vector! But it is still not a good behavior.
I directly calculated cosine similarity score. embedding vector for empty document: dim = 384 |
Beta Was this translation helpful? Give feedback.
-
Could you elaborate on |
Beta Was this translation helpful? Give feedback.
-
I mean that a good scoring behavior is that a low score is for empty document when the query is a natural text. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately, we can't do anything about this, Qdrant provides a way to operate with embeddings, it can't do anything with the embedding values. |
Beta Was this translation helpful? Give feedback.
-
Thank joein. Your clarification is very reasonable. |
Beta Was this translation helpful? Give feedback.
-
When qdrant is used for vector embedding indexing, an empty document in the collection will obtain a high similarity score (strange).
Here we can observe that the empty document strange achieves "0.797" cosine similarity score. If the document is empty, I assume that zero vector is used. cosine similarity score should be 0. Can you help ?
e.g.
from typing import List
import numpy as np
from qdrant_client import QdrantClient
documents: List[str] = [
"",
"email address",
"placeholder",
"",
"wireless customer",
"He died in 1597 at the age of 57",
"Maharana Pratap is considered a symbol of Rajput resistance against foreign rule",
"He had 11 wives and 17 sons, including Amar Singh I who succeeded him as ruler of Mewar",
"total active lines",
""
]
client = QdrantClient(":memory:")
client.set_model("BAAI/bge-small-en")
metadata = [ {"source": "docs"} for doc in documents]
ids = [ idx for idx in range(len(documents))]
client.add(
collection_name="demo_collection",
documents=documents,
metadata=metadata,
ids=ids
)
query = "Count the number of active residential customer"
search_result = client.query(
collection_name = "demo_collection",
query_text = query,
limit= 5)
print(search_result)
Results:
[QueryResponse(id=8, embedding=None, metadata={'document': 'total active lines', 'source': 'docs'}, document='total active lines', score=0.8843179222400359), QueryResponse(id=4, embedding=None, metadata={'document': 'wireless customer', 'source': 'docs'}, document='wireless customer', score=0.8295176016136243), QueryResponse(id=1, embedding=None, metadata={'document': 'email address', 'source': 'docs'}, document='email address', score=0.8228306079924803), QueryResponse(id=2, embedding=None, metadata={'document': 'placeholder', 'source': 'docs'}, document='placeholder', score=0.8144248465983718), QueryResponse(id=9, embedding=None, metadata={'document': '', 'source': 'docs'}, document='', score=0.7972171966992909)]
Beta Was this translation helpful? Give feedback.
All reactions