Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS #600

Closed
mirodrr opened this issue Nov 16, 2023 · 13 comments · Fixed by #603
Closed

[BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS #600

mirodrr opened this issue Nov 16, 2023 · 13 comments · Fixed by #603
Assignees
Labels
bug Something isn't working

Comments

@mirodrr
Copy link

mirodrr commented Nov 16, 2023

What is the bug?

A clear and concise description of the bug.
When using LangChain OpenSearchVectorSearch with OpenSearchServerless and doing:

docsearch = OpenSearchVectorSearch(index_name=ELASTICSEARCH_INDEX,
                                       embedding_function=embeddings,
                                       opensearch_url=f'https://{serverless_collection_host}:443',
                                       **opensearch_serverless_args)
docsearch.add_texts(all_text_split, metadatas=metadatas)

OR

docsearch = OpenSearchVectorSearch.from_texts(
                all_text_split,
                embeddings,
                metadatas=metadatas,
                opensearch_url=f'https://{serverless_collection_host}:443',
                bulk_size=12000,
                index_name=ELASTICSEARCH_INDEX,
                **opensearch_serverless_args
            )

I get the error

'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}

This same code works fine on 2.3.2.

How can one reproduce the bug?

Steps to reproduce the behavior.
Create a "OpenSearchVectorSearch" in LangChain when using opensearch-py 2.4.0, and try to upload documents to it

What is the expected behavior?

A clear and concise description of what you expected to happen.
Documents upload successfully

What is your host/environment?

Operating system, version.
Python Lambda image. Specifically public.ecr.aws/lambda/python:3.9

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.
No

Do you have any additional context?

Add any other context about the problem.
No

@mirodrr mirodrr added bug Something isn't working untriaged Need triage labels Nov 16, 2023
@dblock dblock self-assigned this Nov 17, 2023
@dblock dblock removed the untriaged Need triage label Nov 17, 2023
@dblock
Copy link
Member

dblock commented Nov 17, 2023

I can reproduce this against OpenSearch Serverless w/2.4.1. Code in https://github.com/dblock/opensearch-langchain.

Local OpenSearch

This works with all of OSS OpenSearch, and Amazon Managed OpenSearch and Serverless.

#!/usr/bin/env python3

from os import environ
from typing import List
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings

fake_texts = ["foo", "bar", "baz"]

class FakeEmbeddings(Embeddings):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]

    def embed_query(self, text: str) -> List[float]:
        return [float(1.0)] * 9 + [float(0.0)]
    
docsearch = OpenSearchVectorSearch.from_texts(
    fake_texts, 
    FakeEmbeddings(), 
    opensearch_url='https://localhost:9200', 
    verify_certs=False, 
    http_auth=("admin", "admin")
)

OpenSearchVectorSearch.add_texts(
    docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
)

Amazon OpenSearch

This works with OSS OpenSearch, Amazon Managed OpenSearch, but not with Serverless on 2.4.1.

#!/usr/bin/env python3
import logging
from os import environ
from typing import List
from urllib.parse import urlparse
from opensearchpy import AWSV4SignerAuth, OpenSearch, RequestsHttpConnection, __versionstr__
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings
from boto3 import Session

logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)

opensearch_url = environ['ENDPOINT']
url = urlparse(opensearch_url)
region = environ.get('AWS_REGION', 'us-east-1')
service = environ.get('SERVICE', 'es')
credentials = Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region, service)

print(f"Using opensearch-py {__versionstr__}")

fake_texts = ["foo", "bar", "baz"]

class FakeEmbeddings(Embeddings):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]

    def embed_query(self, text: str) -> List[float]:
        return [float(1.0)] * 9 + [float(0.0)]
    
docsearch = OpenSearchVectorSearch.from_texts(
    fake_texts, 
    FakeEmbeddings(), 
    opensearch_url=opensearch_url,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    http_auth=auth
)

OpenSearchVectorSearch.add_texts(
    docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
)
$ poetry run python hello.py 
INFO:Found credentials in environment variables.
Using opensearch-py 2.3.2
WARNING:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:404 request:0.306s]
INFO:PUT https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:200 request:0.346s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:11.820s]
INFO:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:200 request:0.330s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:0.458s]
$ poetry run python hello.py 
INFO:Found credentials in environment variables.
Using opensearch-py 2.4.1
WARNING:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/976d32bd8a26420c82de3908337e14ce [status:404 request:0.386s]
INFO:PUT https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/976d32bd8a26420c82de3908337e14ce [status:200 request:0.356s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:0.262s]
Traceback (most recent call last):
  File "/Users/dblock/source/langchain/hello/hello.py", line 31, in <module>
    docsearch = OpenSearchVectorSearch.from_texts(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 777, in from_texts
    return cls.from_embeddings(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 901, in from_embeddings
    _bulk_ingest_embeddings(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 138, in _bulk_ingest_embeddings
    bulk(client, requests, max_chunk_bytes=max_chunk_bytes)
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 425, in bulk
    for ok, item in streaming_bulk(client, actions, ignore_status=ignore_status, *args, **kwargs):  # type: ignore
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 338, in streaming_bulk
    for data, (ok, info) in zip(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 273, in _process_bulk_chunk
    for item in gen:
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 202, in _process_bulk_chunk_success
    raise BulkIndexError("%i document(s) failed to index." % len(errors), errors)
opensearchpy.helpers.errors.BulkIndexError: ('3 document(s) failed to index.', [{'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '3ac5cf95-7f07-4d95-a44f-0615bb76aad1', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '0d234ec6-f620-4013-870e-1a082b813c94', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '7c54ac77-3853-4c70-a01b-55dae4de1115', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}}}}])

@harshavamsi
Copy link
Collaborator

Uses bulk

@navneet1v
Copy link

AOSS doesn't support passing _id in bulk. @dblock did we change anything from opensearch-py client?

@dblock

@navneet1v
Copy link

navneet1v commented Nov 17, 2023

@dblock I remember we added a check in langchain to identity a diff between aoss and AOS Ref: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/vectorstores/opensearch_vector_search.py#L83-L91 . it was done using service attribute of AWSV4SignerAuth.

does that get changed?

@dblock
Copy link
Member

dblock commented Nov 17, 2023

I bisected this to #547.

We are sending different data!

method: POST
url: https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk
data: b'{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{},"id":"2a31333a-974f-4641-baf7-41072f221329"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{},"id":"c04199df-3304-416c-a74e-5f6150cbfe22"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{},"id":"da0b79e7-8e5c-4945-ab77-d8be9d5c197e"}\n
method: POST
url: https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk
body: b'{"index":{"_id":"db460c2e-8775-4e0d-b040-2acab808817c","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{}}\n{"index":{"_id":"e2977ca9-3394-4f59-9c8e-f70dd5685c91","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{}}\n{"index":{"_id":"3bf8f6ed-17ed-4f90-a047-5dc0284f1f08","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{}}\n'

@navneet1v
Copy link

@dblock this is interesting. Are we raising a PR to fix this or reverting the change?

@dblock
Copy link
Member

dblock commented Nov 17, 2023

The problem is here: https://github.com/langchain-ai/langchain/blob/b4312aac5c0567088353178fb70fdb356b372e12/libs/langchain/langchain/vectorstores/opensearch_vector_search.py#L83

The code relies on signer#service, which is really an implementation detail, to figure out whether we're talking to AOS or AOSS. The data sent into bulk changes in LangChain depending on that.

[{'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, 'id': 'b46ca273-f691-4b09-9ef8-6e78bd9d0e30'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, 'id': 'a981f234-1fb2-4646-9e1c-bd36a3de21ab'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, 'id': '5a10a420-7a3a-4902-bcec-b7f18f8dde21'}]
[{'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, '_id': '7a0c551c-1764-4e93-9288-0185671a1cf1'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, '_id': '54f9d0d2-6ba9-4bc6-96ab-468400e35790'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, '_id': 'c1a0cc14-80ae-48f7-9b21-475a6fbf5022'}]

@dblock
Copy link
Member

dblock commented Nov 17, 2023

Fix in #603.

@navneet1v
Copy link

navneet1v commented Nov 17, 2023

@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it

@dblock dblock changed the title [BUG] LangChain OpenSearchVectorSearch broken with 2.4.0 and 2.4.1 [BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS Nov 17, 2023
@dblock
Copy link
Member

dblock commented Nov 17, 2023

@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it

I don't believe that's correct, see #603 (comment)

@dblock
Copy link
Member

dblock commented Nov 19, 2023

We've released 2.4.2 with a fix.

@JustinYuYou
Copy link

Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock

@dblock
Copy link
Member

dblock commented Feb 9, 2024

Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock

Can you please open an issue in opensearch-js?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants