Add vllm support for embedding endpoint #3435

ephraimrothschild · 2024-08-30T17:07:14Z

Is your feature request related to a problem? Please describe.

vLLM has added support for running embedding models like intfloat/e5-mistral-7b-instruct, which works with their native OpenAI server. When I send a request to /v1/embeddings with LocalAI started, I get the following error:

"rpc error: code = Unimplemented desc = Unexpected <class 'NotImplementedError'>: Method not implemented!

Describe the solution you'd like

I'd like to be able to run embedding models backed by vLLM through LocalAI as well. Sending the same request to the same endpoint with the vLLM docker container running already works, but I would like to be able to manage this through LocalAI.

Describe alternatives you've considered

While in theory I can run a vLLM instance with this model on a different port, the main purpose of LocalAI to me is to be able to manage the different models and start and stop backend instances based on what is requested. Since there is already support for this in vLLM, my hope is that it isn't too much of a lift to enable it via localAI as well.

The text was updated successfully, but these errors were encountered:

Nyralei · 2024-08-30T18:33:14Z

It's /embeddings, not /v1/embeddings
Try this

ephraimrothschild · 2024-08-30T18:43:29Z

@Nyralei - I've tried both, and both seem to have the same behavior. Sending requests to both /embeddings and /v1/embeddings, I get the following response:

{
    "error": {
        "code": 500,
        "message": "rpc error: code = Unimplemented desc = Unexpected <class 'NotImplementedError'>: Method not implemented!",
        "type": ""
    }
}

For reference, here is the model template:

name: intfloat/e5-mistral-7b-instruct
backend: vllm
parameters:
  model: "intfloat/e5-mistral-7b-instruct"
gpu_memory_utilization: 0.95
max_model_len: 32768
cuda: true

One thing to note - both /embeddings and /v1/embeddings work exactly as expected when I change the only the backend parameter from vllm to transformers. It also loads the model into memory even in localAI's current state (ie with the vllm backend), but then it fails to return a response.

mudler · 2024-08-31T09:03:31Z

that should be quite straightforward to add - I can confirm that currently this is not supported as it is not implemented.

ephraimrothschild added the enhancement New feature or request label Aug 30, 2024

mudler added the roadmap label Aug 31, 2024

mudler mentioned this issue Aug 31, 2024

feat(vllm): add support for embeddings #3440

Merged

1 task

mudler closed this as completed in #3440 Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vllm support for embedding endpoint #3435

Add vllm support for embedding endpoint #3435

ephraimrothschild commented Aug 30, 2024

Nyralei commented Aug 30, 2024

ephraimrothschild commented Aug 30, 2024 •

edited

Loading

mudler commented Aug 31, 2024

Add vllm support for embedding endpoint #3435

Add vllm support for embedding endpoint #3435

Comments

ephraimrothschild commented Aug 30, 2024

Nyralei commented Aug 30, 2024

ephraimrothschild commented Aug 30, 2024 • edited Loading

mudler commented Aug 31, 2024

ephraimrothschild commented Aug 30, 2024 •

edited

Loading