[BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS #600

mirodrr · 2023-11-16T23:33:21Z

What is the bug?

A clear and concise description of the bug.
When using LangChain OpenSearchVectorSearch with OpenSearchServerless and doing:

docsearch = OpenSearchVectorSearch(index_name=ELASTICSEARCH_INDEX,
                                       embedding_function=embeddings,
                                       opensearch_url=f'https://{serverless_collection_host}:443',
                                       **opensearch_serverless_args)
docsearch.add_texts(all_text_split, metadatas=metadatas)

OR

docsearch = OpenSearchVectorSearch.from_texts(
                all_text_split,
                embeddings,
                metadatas=metadatas,
                opensearch_url=f'https://{serverless_collection_host}:443',
                bulk_size=12000,
                index_name=ELASTICSEARCH_INDEX,
                **opensearch_serverless_args
            )

I get the error

'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}

This same code works fine on 2.3.2.

How can one reproduce the bug?

Steps to reproduce the behavior.
Create a "OpenSearchVectorSearch" in LangChain when using opensearch-py 2.4.0, and try to upload documents to it

What is the expected behavior?

A clear and concise description of what you expected to happen.
Documents upload successfully

What is your host/environment?

Operating system, version.
Python Lambda image. Specifically public.ecr.aws/lambda/python:3.9

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.
No

Do you have any additional context?

Add any other context about the problem.
No

The text was updated successfully, but these errors were encountered:

dblock · 2023-11-17T17:30:53Z

I can reproduce this against OpenSearch Serverless w/2.4.1. Code in https://github.com/dblock/opensearch-langchain.

Local OpenSearch

This works with all of OSS OpenSearch, and Amazon Managed OpenSearch and Serverless.

#!/usr/bin/env python3

from os import environ
from typing import List
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings

fake_texts = ["foo", "bar", "baz"]

class FakeEmbeddings(Embeddings):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]

    def embed_query(self, text: str) -> List[float]:
        return [float(1.0)] * 9 + [float(0.0)]
    
docsearch = OpenSearchVectorSearch.from_texts(
    fake_texts, 
    FakeEmbeddings(), 
    opensearch_url='https://localhost:9200', 
    verify_certs=False, 
    http_auth=("admin", "admin")
)

OpenSearchVectorSearch.add_texts(
    docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
)

Amazon OpenSearch

This works with OSS OpenSearch, Amazon Managed OpenSearch, but not with Serverless on 2.4.1.

#!/usr/bin/env python3
import logging
from os import environ
from typing import List
from urllib.parse import urlparse
from opensearchpy import AWSV4SignerAuth, OpenSearch, RequestsHttpConnection, __versionstr__
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.schema.embeddings import Embeddings
from boto3 import Session

logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO)

opensearch_url = environ['ENDPOINT']
url = urlparse(opensearch_url)
region = environ.get('AWS_REGION', 'us-east-1')
service = environ.get('SERVICE', 'es')
credentials = Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region, service)

print(f"Using opensearch-py {__versionstr__}")

fake_texts = ["foo", "bar", "baz"]

class FakeEmbeddings(Embeddings):
    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [[float(1.0)] * 9 + [float(i)] for i in range(len(texts))]

    def embed_query(self, text: str) -> List[float]:
        return [float(1.0)] * 9 + [float(0.0)]
    
docsearch = OpenSearchVectorSearch.from_texts(
    fake_texts, 
    FakeEmbeddings(), 
    opensearch_url=opensearch_url,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    http_auth=auth
)

OpenSearchVectorSearch.add_texts(
    docsearch, fake_texts, vector_field="my_vector", text_field="custom_text"
)

$ poetry run python hello.py 
INFO:Found credentials in environment variables.
Using opensearch-py 2.3.2
WARNING:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:404 request:0.306s]
INFO:PUT https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:200 request:0.346s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:11.820s]
INFO:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/c0220c8ee6e4474289cbc507183ec4ec [status:200 request:0.330s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:0.458s]

$ poetry run python hello.py 
INFO:Found credentials in environment variables.
Using opensearch-py 2.4.1
WARNING:GET https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/976d32bd8a26420c82de3908337e14ce [status:404 request:0.386s]
INFO:PUT https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/976d32bd8a26420c82de3908337e14ce [status:200 request:0.356s]
INFO:POST https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk [status:200 request:0.262s]
Traceback (most recent call last):
  File "/Users/dblock/source/langchain/hello/hello.py", line 31, in <module>
    docsearch = OpenSearchVectorSearch.from_texts(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 777, in from_texts
    return cls.from_embeddings(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 901, in from_embeddings
    _bulk_ingest_embeddings(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/langchain/vectorstores/opensearch_vector_search.py", line 138, in _bulk_ingest_embeddings
    bulk(client, requests, max_chunk_bytes=max_chunk_bytes)
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 425, in bulk
    for ok, item in streaming_bulk(client, actions, ignore_status=ignore_status, *args, **kwargs):  # type: ignore
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 338, in streaming_bulk
    for data, (ok, info) in zip(
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 273, in _process_bulk_chunk
    for item in gen:
  File "/Users/dblock/Library/Caches/pypoetry/virtualenvs/package-O7W1AkK5-py3.9/lib/python3.9/site-packages/opensearchpy/helpers/actions.py", line 202, in _process_bulk_chunk_success
    raise BulkIndexError("%i document(s) failed to index." % len(errors), errors)
opensearchpy.helpers.errors.BulkIndexError: ('3 document(s) failed to index.', [{'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '3ac5cf95-7f07-4d95-a44f-0615bb76aad1', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '0d234ec6-f620-4013-870e-1a082b813c94', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}}}}, {'index': {'_index': '976d32bd8a26420c82de3908337e14ce', '_id': '7c54ac77-3853-4c70-a01b-55dae4de1115', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Document ID is not supported in create/index operation request'}, 'data': {'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}}}}])

harshavamsi · 2023-11-17T17:36:53Z

Uses bulk

navneet1v · 2023-11-17T17:41:27Z

AOSS doesn't support passing _id in bulk. @dblock did we change anything from opensearch-py client?

@dblock

navneet1v · 2023-11-17T17:57:09Z

@dblock I remember we added a check in langchain to identity a diff between aoss and AOS Ref: https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/vectorstores/opensearch_vector_search.py#L83-L91 . it was done using service attribute of AWSV4SignerAuth.

does that get changed?

dblock · 2023-11-17T18:04:13Z

I bisected this to #547.

We are sending different data!

method: POST
url: https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk
data: b'{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{},"id":"2a31333a-974f-4641-baf7-41072f221329"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{},"id":"c04199df-3304-416c-a74e-5f6150cbfe22"}\n{"index":{"_index":"7c0a54aacdac4969ab44c72902977267"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{},"id":"da0b79e7-8e5c-4945-ab77-d8be9d5c197e"}\n

method: POST
url: https://a8x6ix5hz8w4t08uepm3.us-west-2.aoss.amazonaws.com:443/_bulk
body: b'{"index":{"_id":"db460c2e-8775-4e0d-b040-2acab808817c","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0],"text":"foo","metadata":{}}\n{"index":{"_id":"e2977ca9-3394-4f59-9c8e-f70dd5685c91","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0],"text":"bar","metadata":{}}\n{"index":{"_id":"3bf8f6ed-17ed-4f90-a047-5dc0284f1f08","_index":"6c5fdfc0160748a2bf30f45763762c46"}}\n{"vector_field":[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0],"text":"baz","metadata":{}}\n'

navneet1v · 2023-11-17T19:04:07Z

@dblock this is interesting. Are we raising a PR to fix this or reverting the change?

dblock · 2023-11-17T19:25:55Z

The problem is here: https://github.com/langchain-ai/langchain/blob/b4312aac5c0567088353178fb70fdb356b372e12/libs/langchain/langchain/vectorstores/opensearch_vector_search.py#L83

The code relies on signer#service, which is really an implementation detail, to figure out whether we're talking to AOS or AOSS. The data sent into bulk changes in LangChain depending on that.

[{'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, 'id': 'b46ca273-f691-4b09-9ef8-6e78bd9d0e30'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, 'id': 'a981f234-1fb2-4646-9e1c-bd36a3de21ab'}, {'_op_type': 'index', '_index': '6affa81721694f1280bb351dc19ea6fc', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, 'id': '5a10a420-7a3a-4902-bcec-b7f18f8dde21'}]

[{'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0], 'text': 'foo', 'metadata': {}, '_id': '7a0c551c-1764-4e93-9288-0185671a1cf1'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'text': 'bar', 'metadata': {}, '_id': '54f9d0d2-6ba9-4bc6-96ab-468400e35790'}, {'_op_type': 'index', '_index': '0ad24e52981647afbea4784d3dfcd73b', 'vector_field': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0], 'text': 'baz', 'metadata': {}, '_id': 'c1a0cc14-80ae-48f7-9b21-475a6fbf5022'}]

dblock · 2023-11-17T19:32:49Z

Fix in #603.

navneet1v · 2023-11-17T19:36:31Z

@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it

dblock · 2023-11-17T19:53:45Z

@dblock its not necessary related to AOSS. Even in AOS/OSS, if user is asking for OpenSearch service to generate the ids for the data, os client should not send _id in the bulk request. We should respect the customer input, and seems like urlib3 is not respecting it

I don't believe that's correct, see #603 (comment)

dblock · 2023-11-19T23:19:29Z

We've released 2.4.2 with a fix.

JustinYuYou · 2024-02-08T23:15:09Z

Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock

dblock · 2024-02-09T13:46:13Z

Hi, I am running into the exact same issue in javascript sdk. Is there a way you guys fix it there as well? @dblock

Can you please open an issue in opensearch-js?

mirodrr added bug Something isn't working untriaged Need triage labels Nov 16, 2023

dblock self-assigned this Nov 17, 2023

dblock removed the untriaged Need triage label Nov 17, 2023

harshavamsi mentioned this issue Nov 17, 2023

Fix: TypeError on calling parallel_bulk. #601

Merged

rishabh6788 mentioned this issue Nov 17, 2023

Switching to the default connection class to address slower connection with RequestsHttpConnection opensearch-project/opensearch-benchmark#411

Merged

1 task

dblock mentioned this issue Nov 17, 2023

Fix: re-add service to fix integration with LangChain. #603

Merged

dblock changed the title ~~[BUG] LangChain OpenSearchVectorSearch broken with 2.4.0 and 2.4.1~~ [BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS Nov 17, 2023

dblock closed this as completed in #603 Nov 17, 2023

This was referenced Jan 14, 2024

[Bug]: OpenSearchVectorStore not compatible with OpenSearch Serverless on AWS run-llama/llama_index#10042

Closed

Changes mirroring langchain to allow for serverless open search run-llama/llama_index#10043

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS #600

[BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS #600

mirodrr commented Nov 16, 2023 •

edited

Loading

dblock commented Nov 17, 2023 •

edited

Loading

harshavamsi commented Nov 17, 2023

navneet1v commented Nov 17, 2023

navneet1v commented Nov 17, 2023 •

edited

Loading

dblock commented Nov 17, 2023

navneet1v commented Nov 17, 2023

dblock commented Nov 17, 2023

dblock commented Nov 17, 2023

navneet1v commented Nov 17, 2023 •

edited

Loading

dblock commented Nov 17, 2023

dblock commented Nov 19, 2023

JustinYuYou commented Feb 8, 2024

dblock commented Feb 9, 2024

[BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS #600

[BUG] LangChain OpenSearchVectorSearch broken with 2.4.1 w/AOSS #600

Comments

mirodrr commented Nov 16, 2023 • edited Loading

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

dblock commented Nov 17, 2023 • edited Loading

Local OpenSearch

Amazon OpenSearch

harshavamsi commented Nov 17, 2023

navneet1v commented Nov 17, 2023

navneet1v commented Nov 17, 2023 • edited Loading

dblock commented Nov 17, 2023

navneet1v commented Nov 17, 2023

dblock commented Nov 17, 2023

dblock commented Nov 17, 2023

navneet1v commented Nov 17, 2023 • edited Loading

dblock commented Nov 17, 2023

dblock commented Nov 19, 2023

JustinYuYou commented Feb 8, 2024

dblock commented Feb 9, 2024

mirodrr commented Nov 16, 2023 •

edited

Loading

dblock commented Nov 17, 2023 •

edited

Loading

navneet1v commented Nov 17, 2023 •

edited

Loading

navneet1v commented Nov 17, 2023 •

edited

Loading