Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever #3356

Merged
merged 18 commits into from
Oct 14, 2022
Merged

Conversation

vblagoje
Copy link
Member

Related Issues

Proposed Changes:

Added OpenAIEmbeddingEncoder as a method to create document and query embeddings.

How did you test it?

Added a unit test, needs to inject OpenAI API key in unit tests (as a secret)

Notes for the reviewer

LMK if anything is unclear

Checklist

@vblagoje vblagoje requested a review from a team as a code owner October 10, 2022 13:25
@vblagoje vblagoje requested review from bogdankostic and removed request for a team October 10, 2022 13:25
Copy link
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks already pretty good, left a few comments with some possible improvements.

Comment on lines 230 to 233
if isinstance(document_store, WeaviateDocumentStore):
# Weaviate sets the embedding dimension to 768 as soon as it is initialized.
# We need 1024 here and therefore initialize a new WeaviateDocumentStore.
document_store = WeaviateDocumentStore(index="haystack_test", embedding_dim=1024, recreate_index=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not needed as we specify to use only InMemoryDocumentStore in the test parameters.

Comment on lines 254 to 257
if isinstance(document_store, WeaviateDocumentStore):
# Weaviate sets the embedding dimension to 768 as soon as it is initialized.
# We need 1024 here and therefore initialize a new WeaviateDocumentStore.
document_store = WeaviateDocumentStore(index="haystack_test", embedding_dim=1024, recreate_index=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

self.doc_model_encoder_engine = f"text-search-{model_class}-doc-001"
self.tokenizer = AutoTokenizer.from_pretrained("gpt2")

def ensure_texts_limit(self, text: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we are inside a private class here, still I'd make this method private as it's not supposed to be used outside of that class I guess.

tokenized_payload = self.tokenizer(text)
return self.tokenizer.decode(tokenized_payload["input_ids"][: self.max_seq_len])

def embed(self, model, text: str) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a type for the model argument.

Comment on lines 422 to 424
for doc in docs:
embedding = self.embed(self.doc_model_encoder_engine, doc.content)
embeddings.append(embedding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to OpenAI Documentation, we can get embeddings for multiple inputs in a single request. My guess is that this would probably be a bit more efficient than doing one request for each Document.

Also, we should probably take care of OpenAI's rate limit given that we usually create embeddings for a large number of Documents. Timo worked in #3078 on a solution for the OpenAIAnswerGenerator (this PR got unfortunately stale). Other than that, you might also want to take a look at this notebook by OpenAI on best practices for rate limit handling.

@@ -1541,6 +1543,7 @@ def __init__(
This approach is also used in the TableTextRetriever paper and is likely to improve
performance if your titles contain meaningful information for retrieval
(topic, entities etc.).
:param api_key: The OpenAI API key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc string should explain that the OpenAI API key is only needed when we use a model by OpenAI and maybe link to the OpenAI page where the user can sign up for an API key.

haystack/nodes/retriever/dense.py Show resolved Hide resolved
self.max_seq_len = retriever.max_seq_len
self.url = "https://api.openai.com/v1/embeddings"
self.api_key = retriever.api_key
model_class: str = next(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just model_class = retriever.embedding_model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point but I wanted to handle the case when users accidentally specify the full name of the model. Some might specify "ada", "babbage" etc and some might specify the full name. This way we handle properly both use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm just wondering, what if the user want to use text-similarity-ada-001 model for example. In this case, we would silently use text-search-ada-doc-001 / text-search-ada-query-001 without the user knowing that.
We should also probably adapt the docstring of the param embedding_model of the EmbeddingRetriever, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdankostic I thought about it too, but that should not happen as the use case does not match. See https://beta.openai.com/docs/guides/embeddings/similarity-embeddings and https://beta.openai.com/docs/guides/embeddings/text-search-embeddings for recommended use-cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our use case is definitely Text search embeddings

@vblagoje vblagoje changed the title Add OpenAIEmbeddingEncoder to EmbeddingRetriever feat: Add OpenAIEmbeddingEncoder to EmbeddingRetriever Oct 11, 2022
Copy link
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good to go, just proposed some minor improvements.

self.max_seq_len = retriever.max_seq_len
self.url = "https://api.openai.com/v1/embeddings"
self.api_key = retriever.api_key
model_class: str = next(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm just wondering, what if the user want to use text-similarity-ada-001 model for example. In this case, we would silently use text-search-ada-doc-001 / text-search-ada-query-001 without the user knowing that.
We should also probably adapt the docstring of the param embedding_model of the EmbeddingRetriever, what do you think?

Comment on lines 423 to 426
batch_limited = []
batch = text[i : i + self.batch_size]
for content in batch:
batch_limited.append(self._ensure_text_limit(content))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make use of list comprehension here:

Suggested change
batch_limited = []
batch = text[i : i + self.batch_size]
for content in batch:
batch_limited.append(self._ensure_text_limit(content))
batch = text[i : i + self.batch_size]
batch_limited = [self._ensure_text_limit(content) for content in batch]

self.doc_model_encoder_engine = f"text-search-{model_class}-doc-001"
self.tokenizer = AutoTokenizer.from_pretrained("gpt2")

def _ensure_text_limit(self, text: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the return type here.

Comment on lines 1546 to 1547
:param api_key: The OpenAI API key. Required if one wants to use OpenAI embeddings. For more
details see https://beta.openai.com/account/api-keys for more details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for more details" is doubled here.

@bogdankostic bogdankostic added type:feature New feature or request topic:retriever labels Oct 13, 2022
Copy link
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vblagoje vblagoje merged commit 159cd5a into main Oct 14, 2022
@vblagoje vblagoje deleted the openai_encoder branch October 14, 2022 13:01
@masci masci mentioned this pull request Dec 19, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:retriever type:feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants