Support guided decoding with vllm and remote::vllm #391

ashwinb · 2024-11-07T06:32:53Z

🚀 The feature, motivation and pitch

(fireworks, together, meta-reference) support guided decoding (specifying a json-schema for example, as a "grammar" for decoding) with inference. vLLM supports this functionality -- enable that in the API

Alternatives

No alternatives, this is a core feature that must be supported by all providers (as far as possible).

Additional context

No response

aidando73 · 2024-11-23T22:57:53Z

Hi @ashwinb, @yanxi0830 - I came across this issue and I'm interested in contributing. Do you mind if I take a stab?

I've reviewed the contributing guidelines and the existing codebase, and I believe I have a good understanding of what needs to be done. However, if there are any specific details or guidelines you think I should be aware of, please lmk.

ashwinb · 2024-11-23T23:53:44Z

@aidando73 Please go ahead. I remember seeing a PR about this (but it was not quite on the correct track) so I think you are good to do. Re: guidelines, consider the response_format parameter here: https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/vllm/vllm.py#L86 .. you'd need to understand what this parameter does semantically. Then you'd need to know whether vllm's API allows for this (I think it does hence the issue) and then make sure you appropriately translate.

Finally, when everything is ready, you'd need to run the tests inside providers/tests/inference/test_text_inference.py (test_structured_output) to check if all is well. Please ask us any questions.

aidando73 · 2024-11-24T00:07:47Z

Thanks @ashwinb 🙏, will get started

terrytangyuan · 2024-11-24T01:08:15Z

@aidando73 Thanks! Please feel free to tag me for review once it's ready.

aidando73 · 2024-11-26T11:08:59Z

Ok @ashwinb @terrytangyuan PR is ready for review: #528 👈 . Lmk if I've missed anything.

# What does this PR do? Addresses issue (#391) - Adds json structured output for vLLM - Enables structured output tests for vLLM > Give me a recipe for Spaghetti Bolognaise: ```json { "recipe_name": "Spaghetti Bolognaise", "preamble": "Ah, spaghetti bolognaise - the quintessential Italian dish that fills my kitchen with the aromas of childhood nostalgia. As a child, I would watch my nonna cook up a big pot of spaghetti bolognaise every Sunday, filling our small Italian household with the savory scent of simmering meat and tomatoes. The way the sauce would thicken and the spaghetti would al dente - it was love at first bite. And now, as a chef, I want to share that same love with you, so you can recreate these warm, comforting memories at home.", "ingredients": [ "500g minced beef", "1 medium onion, finely chopped", "2 cloves garlic, minced", "1 carrot, finely chopped", " celery, finely chopped", "1 (28 oz) can whole peeled tomatoes", "1 tbsp tomato paste", "1 tsp dried basil", "1 tsp dried oregano", "1 tsp salt", "1/2 tsp black pepper", "1/2 tsp sugar", "1 lb spaghetti", "Grated Parmesan cheese, for serving", "Extra virgin olive oil, for serving" ], "steps": [ "Heat a large pot over medium heat and add a generous drizzle of extra virgin olive oil.", "Add the chopped onion, garlic, carrot, and celery and cook until the vegetables are soft and translucent, about 5-7 minutes.", "Add the minced beef and cook until browned, breaking it up with a spoon as it cooks.", "Add the tomato paste and cook for 1-2 minutes, stirring constantly.", "Add the canned tomatoes, dried basil, dried oregano, salt, black pepper, and sugar. Stir well to combine.", "Bring the sauce to a simmer and let it cook for 20-30 minutes, stirring occasionally, until the sauce has thickened and the flavors have melded together.", "While the sauce cooks, bring a large pot of salted water to a boil and cook the spaghetti according to the package instructions until al dente. Reserve 1 cup of pasta water before draining the spaghetti.", "Add the reserved pasta water to the sauce and stir to combine.", "Combine the cooked spaghetti and sauce, tossing to coat the pasta evenly.", "Serve hot, topped with grated Parmesan cheese and a drizzle of extra virgin olive oil.", "Enjoy!" ] } ``` Generated with Llama-3.2-3B-Instruct model - pretty good for a 3B parameter model 👍 ## Test Plan `pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -k llama_3b-vllm_remote` With the following setup: ```bash # Environment export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct export INFERENCE_PORT=8000 export VLLM_URL=http://localhost:8000/v1 # vLLM server sudo docker run --gpus all \ -v $STORAGE_DIR/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=$(cat ~/.cache/huggingface/token)" \ -p 8000:$INFERENCE_PORT \ --ipc=host \ --net=host \ vllm/vllm-openai:v0.6.3.post1 \ --model $INFERENCE_MODEL # llama-stack server llama stack build --template remote-vllm --image-type conda && llama stack run distributions/remote-vllm/run.yaml \ --port 5001 \ --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct ``` Results: ``` llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_3b-vllm_remote] SKIPPED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_3b-vllm_remote] SKIPPED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_3b-vllm_remote] PASSED ================================ 6 passed, 2 skipped, 120 deselected, 2 warnings in 13.26s ================================ ``` ## Sources - vllm-project/vllm#8300 - By default, vLLM uses https://github.com/dottxt-ai/outlines for structured outputs [[1](https://github.com/vllm-project/vllm/blob/32e7db25365415841ebc7c4215851743fbb1bad1/vllm/engine/arg_utils.py#L279-L280)] ## Before submitting [N/A] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case) - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? [N/A?] Updated relevant documentation. Couldn't find any relevant documentation. Lmk if I've missed anything. - [x] Wrote necessary unit or integration tests.

terrytangyuan · 2024-12-22T16:41:41Z

We can close this?

ashwinb · 2024-12-22T18:04:23Z

Yes I think so

yanxi0830 added the good first issue Good for newcomers label Nov 8, 2024

ashwinb assigned aidando73 Nov 23, 2024

aidando73 mentioned this issue Nov 26, 2024

[#391] Add support for json structured output for vLLM #528

Merged

3 tasks

aidando73 added a commit to aidando73/llama-stack that referenced this issue Nov 26, 2024

[meta-llama#391] Add support for json structured output for vLLM

1801aa1

aidando73 mentioned this issue Dec 21, 2024

JSON structured outputs for Ollama #679

Closed

ashwinb closed this as completed Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support guided decoding with vllm and remote::vllm #391

Support guided decoding with vllm and remote::vllm #391

ashwinb commented Nov 7, 2024

aidando73 commented Nov 23, 2024 •

edited

Loading

ashwinb commented Nov 23, 2024

aidando73 commented Nov 24, 2024

terrytangyuan commented Nov 24, 2024

aidando73 commented Nov 26, 2024

terrytangyuan commented Dec 22, 2024

ashwinb commented Dec 22, 2024

Support guided decoding with vllm and remote::vllm #391

Support guided decoding with vllm and remote::vllm #391

Comments

ashwinb commented Nov 7, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

aidando73 commented Nov 23, 2024 • edited Loading

ashwinb commented Nov 23, 2024

aidando73 commented Nov 24, 2024

terrytangyuan commented Nov 24, 2024

aidando73 commented Nov 26, 2024

terrytangyuan commented Dec 22, 2024

ashwinb commented Dec 22, 2024

aidando73 commented Nov 23, 2024 •

edited

Loading