Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vllm support for embedding endpoint #3435

Closed
ephraimrothschild opened this issue Aug 30, 2024 · 3 comments · Fixed by #3440
Closed

Add vllm support for embedding endpoint #3435

ephraimrothschild opened this issue Aug 30, 2024 · 3 comments · Fixed by #3440
Labels
enhancement New feature or request roadmap

Comments

@ephraimrothschild
Copy link

Is your feature request related to a problem? Please describe.

vLLM has added support for running embedding models like intfloat/e5-mistral-7b-instruct, which works with their native OpenAI server. When I send a request to /v1/embeddings with LocalAI started, I get the following error:

"rpc error: code = Unimplemented desc = Unexpected <class 'NotImplementedError'>: Method not implemented!

Describe the solution you'd like

I'd like to be able to run embedding models backed by vLLM through LocalAI as well. Sending the same request to the same endpoint with the vLLM docker container running already works, but I would like to be able to manage this through LocalAI.

Describe alternatives you've considered

While in theory I can run a vLLM instance with this model on a different port, the main purpose of LocalAI to me is to be able to manage the different models and start and stop backend instances based on what is requested. Since there is already support for this in vLLM, my hope is that it isn't too much of a lift to enable it via localAI as well.

@ephraimrothschild ephraimrothschild added the enhancement New feature or request label Aug 30, 2024
@Nyralei
Copy link
Contributor

Nyralei commented Aug 30, 2024

It's /embeddings, not /v1/embeddings
Try this

@ephraimrothschild
Copy link
Author

ephraimrothschild commented Aug 30, 2024

@Nyralei - I've tried both, and both seem to have the same behavior. Sending requests to both /embeddings and /v1/embeddings, I get the following response:

{
    "error": {
        "code": 500,
        "message": "rpc error: code = Unimplemented desc = Unexpected <class 'NotImplementedError'>: Method not implemented!",
        "type": ""
    }
}

For reference, here is the model template:

name: intfloat/e5-mistral-7b-instruct
backend: vllm
parameters:
  model: "intfloat/e5-mistral-7b-instruct"
gpu_memory_utilization: 0.95
max_model_len: 32768
cuda: true

One thing to note - both /embeddings and /v1/embeddings work exactly as expected when I change the only the backend parameter from vllm to transformers. It also loads the model into memory even in localAI's current state (ie with the vllm backend), but then it fails to return a response.

@mudler mudler added the roadmap label Aug 31, 2024
@mudler
Copy link
Owner

mudler commented Aug 31, 2024

that should be quite straightforward to add - I can confirm that currently this is not supported as it is not implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request roadmap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants