-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Rebuff SDK #88
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Mehrin, I don't think we've interacted yet, but I'm Risto and I've been contributing a little to this project. I think that's great that you're adding a Python SDK! I thought I'd go through and add some comments for where I think the code could be slightly improved. Let me know if you have any questions or clarifications on my review.
Hi Risto, Thank you so much for your review and valuable contributions to the project. I have tried to work on your review. The main changes include:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments. I have a few more... 😁
|
||
rebuff_vector_score = 0 | ||
similarity_threshold = 0.3 | ||
vector_store._text_key = "input" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initialization of Pinecone client requires a _text_key
field. I am not sure what is the _text_key
exactly for though I have added the reference as a comment in code as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that link was helpful. _text_key
is being use as the key for a metadata field that always gets added to a vector when it's inserted. If you don't specify it, it defaults to "text"
as the metadata key. In the JS SDK, we're not specifying any value, so it's using the default value of "text"
. Since we want Python and JS to match, I think you should remove this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for elaborating on it, it makes more sense now. I think _text_key
is also being used when searching for similar documents. Hence, if I remove the said line, I get the error: Found document with no 'text' key. Skipping.
@seanpmorgan can probably shed more light on it but I think the previous/earlier entries in Pinecone have been made against self._text_key = 'input'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. In the particular Pinecone database you're testing against, the entries have a metadata field of "input"
but not "text"
, so the results all get skipped unless you set self._text_key = 'input'
.
I understand why Langchain has this behavior (when querying a vector database, you almost always want the text content of the vector), but in our use case, we're not actually doing anything with the text content of the vector, just the score. I think the correct way forward at this point is to submit a PR to Langchain to have the option to not skip results when the metadata doesn't contain the text key (I'll submit that PR).
I think you can still move forward with this PR in the meantime by instead of doing self._text_key = 'input'
, do:
vector_store = Pinecone.from_existing_index(index, openai_embeddings, text_key="input")
in init_pinecone
. Whenever possible, it's better to set attributes via constructor or function arguments instead of directly on the instance (in Python, it's a convention that attributes that start with an underscore should not be accessed outside of that class).
I looked at the JS code, and Langchain does not skip vectors when this.textKey
is not found in the metadata, so we don't need to make any JS changes in regards to this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Langchain PR: langchain-ai/langchain#15663
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for elaborating on it, and for submitting the Langchain PR. I have updated init_pinecone
as you suggested.
python-sdk/rebuff/sdk.py
Outdated
class RebuffSdk: | ||
def __init__( | ||
self, | ||
openai_model: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to match the JS SDK and make this an optional parameter with a default of "gpt-3.5-turbo".
python-sdk/setup.py
Outdated
@@ -4,7 +4,7 @@ | |||
name="rebuff", | |||
version="0.0.5", | |||
packages=find_packages(), | |||
install_requires=["pydantic>=1", "requests<3,>=2"], | |||
install_requires=["pydantic>=1", "requests<3,>=2", "langchain>=0.0.100"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs pinecone-client
and openai
. You could then remove langchain
and openai
from "dev"
below.
I also got an error about not having tiktoken
installed. I believe you'll need to add that as well.
tiktoken error
Traceback (most recent call last):
File "/home/risto/repos/protectai/rebuff/python-sdk/.venv/lib/python3.11/site-packages/langchain_community-0.0.9-py3.11.egg/langchain_community/embeddings/openai.py", line 450, in _get_len_safe_embeddings
import tiktoken
ModuleNotFoundError: No module named 'tiktoken'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/risto/repos/protectai/rebuff/python-sdk/detect_rebuff.py", line 25, in <module>
print(rb.detect_injection(user_input))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/rebuff/sdk.py", line 87, in detect_injection
vector_score = detect_pi_using_vector_database(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/rebuff/detect_pi_vectorbase.py", line 26, in detect_pi_using_vector_database
results = vector_store.similarity_search_with_score(input, top_k)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/.venv/lib/python3.11/site-packages/langchain_community-0.0.9-py3.11.egg/langchain_community/vectorstores/pinecone.py", line 175, in similarity_search_with_score
self._embed_query(query), k=k, filter=filter, namespace=namespace
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/.venv/lib/python3.11/site-packages/langchain_community-0.0.9-py3.11.egg/langchain_community/vectorstores/pinecone.py", line 93, in _embed_query
return self._embedding.embed_query(text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/.venv/lib/python3.11/site-packages/langchain_community-0.0.9-py3.11.egg/langchain_community/embeddings/openai.py", line 697, in embed_query
return self.embed_documents([text])[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/.venv/lib/python3.11/site-packages/langchain_community-0.0.9-py3.11.egg/langchain_community/embeddings/openai.py", line 668, in embed_documents
return self._get_len_safe_embeddings(texts, engine=engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/.venv/lib/python3.11/site-packages/langchain_community-0.0.9-py3.11.egg/langchain_community/embeddings/openai.py", line 452, in _get_len_safe_embeddings
raise ImportError(
ImportError: Could not import tiktoken python package. This is needed in order to for OpenAIEmbeddings. Please install it with `pip install tiktoken`.
""" | ||
openai.api_key = api_key | ||
|
||
completion = openai.ChatCompletion.create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I installed the latest openai
version, and I got this error when executing:
Traceback (most recent call last):
File "/home/risto/repos/protectai/rebuff/python-sdk/detect_rebuff.py", line 25, in <module>
print(rb.detect_injection(user_input))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/rebuff/sdk.py", line 97, in detect_injection
model_response = call_openai_to_detect_pi(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/rebuff/detect_pi_openai.py", line 71, in call_openai_to_detect_pi
completion = openai.ChatCompletion.create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/risto/repos/protectai/rebuff/python-sdk/.venv/lib/python3.11/site-packages/openai/lib/_old_api.py", line 39, in __call__
raise APIRemovedInV1(symbol=self._symbol)
openai.lib._old_api.APIRemovedInV1:
You tried to access openai.ChatCompletion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.
You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface.
Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`
A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742
Instead of using a deprecated API, can you use the updated API for openai
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for highlighting this. I have updated the code for updated openai
API. Also, I noticed that the class OpenAIEmbeddings
is deprecated in v0.1.0 and might get removed in v0.2.0 so also updated code with the alternative they have suggested: langchain_openai.OpenAIEmbeddings
. I have also added langchain_openai
in setup.py
under install_requires
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great!
|
||
rebuff_vector_score = 0 | ||
similarity_threshold = 0.3 | ||
vector_store._text_key = "input" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Langchain PR: langchain-ai/langchain#15663
This PR looks good from my perspective. Thanks for all your work on it, Mehrin! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks @mehrinkiani and @ristomcgehee for the review!
This PR adds a first class support for Python SDK for Rebuff.
Please note you will need the following credentials for testing the PR:
pinecone_environment
pinecone_apikey
pinecone_index
openai_model
openai_apikey
I will open new issues for creating support for Chroma and for adding tests for Python SDK.