OpenAIEmbeddings causes CUDA bug #27266

pengfeihe2024 · 2024-10-11T05:59:46Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Working code

from openai import OpenAI
from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

models = client.models.list()
model = 'BAAI/bge-en-icl'

responses = client.embeddings.create(
    input=[
        "Hello my name is",
        "The best thing about vLLM is that it supports many different models",
        "annual wellness",
        "What is an Annual Wellness Visit? An Annual Wellness Visit (ANNUAL WELLNESS VISIT) is a yearly appointment with your healthcare provider focused on preventive care."
    ],
    model=model,
)
for data in responses.data:
    # print(data.embedding)  # list of float of len 4096
    print(len(data.embedding))

Non-working code will trigger the vLLM index select error on some tokens

from openai import OpenAI
from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings


embeddings = OpenAIEmbeddings(
                openai_api_base = "http://localhost:8000/v1",
                openai_api_key = "token-abc123",
                model = 'BAAI/bge-en-icl',
                openai_api_type="openai",
                chunk_size = 1
            )
text = "what is an annual anual visit"
# text = "annual wellness"
text = "annual wellness"
query_result = embeddings.embed_query(text)
print(len(query_result))

Error Message and Stack Trace (if applicable)

../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
ERROR 10-10 22:51:09 engine.py:157] RuntimeError('CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`')
ERROR 10-10 22:51:09 engine.py:157] Traceback (most recent call last):
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 155, in start
ERROR 10-10 22:51:09 engine.py:157]     self.run_engine_loop()
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop
ERROR 10-10 22:51:09 engine.py:157]     request_outputs = self.engine_step()
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step
ERROR 10-10 22:51:09 engine.py:157]     raise e
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step
ERROR 10-10 22:51:09 engine.py:157]     return self.engine.step()
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1264, in step
ERROR 10-10 22:51:09 engine.py:157]     outputs = self.model_executor.execute_model(
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 130, in execute_model
ERROR 10-10 22:51:09 engine.py:157]     output = self.driver_worker.execute_model(execute_model_req)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 10-10 22:51:09 engine.py:157]     output = self.model_runner.execute_model(
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 10-10 22:51:09 engine.py:157]     return func(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/embedding_model_runner.py", line 115, in execute_model
ERROR 10-10 22:51:09 engine.py:157]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 22:51:09 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 22:51:09 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama_embedding.py", line 41, in forward
ERROR 10-10 22:51:09 engine.py:157]     return self.model.forward(input_ids, positions, kv_caches,
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 329, in forward
ERROR 10-10 22:51:09 engine.py:157]     hidden_states, residual = layer(
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 22:51:09 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 22:51:09 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 251, in forward
ERROR 10-10 22:51:09 engine.py:157]     hidden_states = self.self_attn(
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 22:51:09 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 22:51:09 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 178, in forward
ERROR 10-10 22:51:09 engine.py:157]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 10-10 22:51:09 engine.py:157]     return self._call_impl(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 10-10 22:51:09 engine.py:157]     return forward_call(*args, **kwargs)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 367, in forward
ERROR 10-10 22:51:09 engine.py:157]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 10-10 22:51:09 engine.py:157]   File "/home/pii/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 135, in apply
ERROR 10-10 22:51:09 engine.py:157]     return F.linear(x, layer.weight, bias)
ERROR 10-10 22:51:09 engine.py:157] RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
CRITICAL 10-10 22:51:09 launcher.py:72] AsyncLLMEngine has failed, terminating server process
INFO:     127.0.0.1:33778 - "POST /v1/embeddings HTTP/1.1" 500 Internal Server Error
...
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1231: indexSelectSmallIndex: block: [27,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Description

I am testing creating embeddings using vLLM endpoint with langchain embedding wrapper. Non-working code based on langchain.OpenAIEmbeddings will trigger CUDA error on the vllm side.
The reason I believe there is a bug in langchain OpenAIEmbeddings is that I have both a working code based on OpenAI and non-working code based on langchain. Plus, there is no quantization and parallelization enabled on vLLM side.

To reproduce the error:

install vllm required packages and run vllm serve BAAI/bge-en-icl
Run the two versions scripts above
Working code runs fine on any text input. Non-working code will fail on some token sequences. Here I found that it fail for input text "annual wellness".

System Info

System Information

OS: Linux
OS Version: #129~20.04.1-Ubuntu SMP Wed Aug 7 13:07:13 UTC 2024
Python Version: 3.9.20 (main, Oct 3 2024, 07:27:41)
[GCC 11.2.0]

Package Information

langchain_core: 0.3.10
langchain: 0.3.3
langchain_community: 0.2.7
langsmith: 0.1.130
langchain_experimental: 0.0.62
langchain_openai: 0.2.2
langchain_text_splitters: 0.3.0

Optional packages not installed

langgraph
langserve

Other Dependencies

aiohttp: 3.10.8
async-timeout: 4.0.3
dataclasses-json: 0.6.7
httpx: 0.27.2
jsonpatch: 1.33
numpy: 1.26.4
openai: 1.51.0
orjson: 3.10.7
packaging: 24.1
pydantic: 2.7.4
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.35
tenacity: 8.5.0
tiktoken: 0.7.0
typing-extensions: 4.12.2

The text was updated successfully, but these errors were encountered:

kodychik · 2024-10-13T22:58:28Z

Hi can I work on this issue?
Thanks

dosubot · 2025-01-12T16:03:51Z

Hi, @pengfeihe2024. I'm Dosu, and I'm helping the LangChain team manage their backlog. I'm marking this issue as stale.

Issue Summary

You reported a CUDA error with the OpenAIEmbeddings class during the embed_query method.
The error persists despite updating to the latest version of LangChain.
Example code was provided to illustrate the problem.
@kodychik has shown interest in working on this issue.

Next Steps

Please confirm if this issue is still relevant with the latest version of the LangChain repository. If so, you can keep the discussion open by commenting here.
If there is no further activity, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Oct 11, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAIEmbeddings causes CUDA bug #27266

OpenAIEmbeddings causes CUDA bug #27266

pengfeihe2024 commented Oct 11, 2024 •

edited

Loading

kodychik commented Oct 13, 2024

dosubot bot commented Jan 12, 2025

OpenAIEmbeddings causes CUDA bug #27266

OpenAIEmbeddings causes CUDA bug #27266

Comments

pengfeihe2024 commented Oct 11, 2024 • edited Loading

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

kodychik commented Oct 13, 2024

dosubot bot commented Jan 12, 2025

pengfeihe2024 commented Oct 11, 2024 •

edited

Loading