-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Support Guided Decoding in LLM
entrypoint
#3536
Comments
Will work on this. |
Any update on this? |
still working on this, sorry about the delay |
I'm trying to understand why this support is considered missing if you can already do this: from vllm import LLM, SamplingParams
from outlines.serve.vllm import JSONLogitsProcessor
from pydantic import BaseModel, conlist
import datetime as dt
class Output(BaseModel):
names: conlist(str, max_length=5)
organizations: conlist(str, max_length=5)
locations: conlist(str, max_length=5)
miscellanous: conlist(str, max_length=5)
llm = LLM('mistralai/Mistral-7B-v0.1', max_model_len=10_000, gpu_memory_utilization=0.9)
logits_processor = JSONLogitsProcessor(schema=Output, llm=llm.llm_engine)
logits_processor.fsm.vocabulary = list(logits_processor.fsm.vocabulary)
prompt = """
Locate all the names, organizations, locations and other miscellaneous entities in the following sentence:
"Charles went and saw Anna at the coffee shop Starbucks, which was based in a small town in Germany called Essen."
"""
sampling_params = SamplingParams(max_tokens=128, temperature=0, logits_processors=[logits_processor])
t0 = dt.datetime.now()
llm.generate([prompt] * 256, sampling_params=sampling_params)
time_elapsed = (dt.datetime.now() - t0).total_seconds()
print(f"Generation took {time_elapsed:,} seconds.") (Example taken from #3087). Is the point of this issue to make the use of guided decoding more intuitive for the user? |
Yes just for ease of use, and a way to better reset the processor. |
how to use in openai api([OpenAI Compatible Server])? |
Please use the extra parameters listed here |
Is there a version limit? Exception, is there a complete call example, thank you. |
Sorry I don't have time to come up with an full example. You can take a look at the tests (e.g. As for the version limit, you can set the version of the docs via the bottom right menu and find the earliest one that documents these parameters. |
ok,thanks. input: |
How are you accessing the endpoint? |
yes,my endpoint is using vllm openai server,like :xxxx/v1/chat/completions |
Can you show your code? |
I'm having the same issue with it. I'm using 0.6.4post1 |
You made a typo. The field is called |
🚀 The feature, motivation and pitch
Currently we support guided decoding of (JSON, Regex, Choice, Grammar, and arbitrary JSON) in OpenAI inference server. It would be great that we expose the same functionality in the offline interface as well.
https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters
Concretely this would mean adding the support here as a new parameter to generate call. Using methods introduced in #2819.
Do make sure to add test and examples.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: