-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context Free Grammar Constrained Decoding (ebnf interface, compatible with llama-cpp) #27557
base: main
Are you sure you want to change the base?
Conversation
I think it is a great idea to be compatible with llama.cpp! |
Hey @Saibo-creator, I tried running Traceback (most recent call last):
File "/home/user/grammar.py", line 657, in <module>
state = parse_ebnf(input_text)
File "/home/user/grammar.py", line 249, in parse_ebnf
grammar_repr = parse_rule(state, grammar_repr)
File "/home/user/grammar.py", line 231, in parse_rule
pos = parse_alternates(state, pos, name, rule_id, False)
File "/home/user/grammar.py", line 212, in parse_alternates
while pos[0] == "|":
IndexError: string index out of range |
Hello @abhinavkulkarni , Try
instead of
Let me know if this doesn't work |
Thanks @Saibo-creator, that works. I have the following piece of code: model_id = "TheBloke/zephyr-7B-alpha-AWQ"
tokenizer = LlamaTokenizerFast.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")
with open("./json.gbnf", "r") as file:
grammar_str = file.read()
grammar = IncrementalGrammarAcceptor(grammar_str, "root", tokenizer)
logits_processor = GrammarConstrainedLogitsProcessor(grammar, batch_size=2, num_beams=1)
prompt = f'''What is the difference between nuclear fusion and fission?
###Response:'''
input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
output = model.generate(
inputs=input_ids,
# do_sample=True,
# temperature=0.7,
# top_p=0.15,
# top_k=0,
max_new_tokens=512,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
logits_processor=[logits_processor],
streamer=streamer) I get a response that starts with:
but then continues to output Please note, if I don't specify custom
|
Thank you for testing! I will look into this issue! By the way, I just integrated the from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.grammar_utils import IncrementalGrammarConstraint
if __name__ == '__main__':
torch.manual_seed(2)
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id)
with open("examples/grammars/json.gbnf", "r") as file:
grammar_str = file.read()
grammar = IncrementalGrammarConstraint(grammar_str, "root", tokenizer)
prefix1= "This is a valid json string for email:"
prefix2= "This is a valid json string for shopping cart:"
input_ids = tokenizer([prefix1, prefix2],add_special_tokens=False, return_tensors="pt", padding=True)["input_ids"]
output = model.generate(input_ids, do_sample=False, max_length=30, num_beams=2, grammar=grammar,
num_return_sequences=2)
# decode output
generations = tokenizer.batch_decode(output, skip_special_tokens=True)
print(generations)
"""
'This is a valid json string for email:{ "title": "Theory", "text": "Theory", "type": "text", "text": "Theory", "type',
'This is a valid json string for shopping cart:{ "name": "MyCart", "price": "10", "price": "10", "price": "10", "price": "'
""" If you have time, could you try to call via above api and confirm if the problem remains? For GPT2, it works as expected, so this may be related to the specific implementation of llama-tokenizer. I will try to fix it asap |
For prompt:
I do get a JSON-looking response, but then again, the model continues to output newlines until it hits the token limit:
|
@abhinavkulkarni In the json grammar, we have and the last This may not be a desired behavior, so I removed that But it does surprise me that the model didn't pick |
@Saibo-creator: Thanks for the changes. Removing the whitespace For the simple prompt,
A few points:
|
Regarding resetting the state of grammar processor, here is my consideration: Currently the And if the user wants to start a new generation, a new instance of
I don't get this point though.
Does this sound reasonable for you? Regarding the design choice to put the parsing state inside the from transformers import AutoModelForCausalLM, AutoTokenizer,TextStreamer, set_seed
from transformers.generation.grammar_utils import IncrementalGrammarConstraint
from transformers.generation.logits_process import GrammarConstrainedLogitsProcessor
if __name__ == '__main__':
import logging
logging.getLogger("transformers.generation").setLevel(logging.INFO)
# model_id = "saibo/llama-1B"
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id)
with open("examples/grammars/json.gbnf", "r") as file:
grammar_str = file.read()
grammar = IncrementalGrammarConstraint(grammar_str, "root", tokenizer)
prefix1= "This is a valid json string for email:"
prefix2= "This is a valid json string for shopping cart:"
input_ids = tokenizer([prefix2],add_special_tokens=False, return_tensors="pt", padding=True)["input_ids"]
logits_processor = GrammarConstrainedLogitsProcessor(grammar)
###################################################
# generation under the Grammar constraint for 10 tokens
##################################################
output = model.generate(input_ids, do_sample=False, max_new_tokens=10, num_beams=2, logits_processor=[logits_processor],
num_return_sequences=1, repetition_penalty=1.5)
generations = tokenizer.batch_decode(output, skip_special_tokens=True)
print(generations)
# 'This is a valid json string for shopping cart:{ "name": "MyCart", "price'
###################################################
# Continue the generation under the same constraint for 10 tokens
#
# 1. Need to use the output of the previous generation as the input for the next generation
# 2. Reuse the same logits_processor because the parser state is stored in the logits_processor
#
##################################################
output = model.generate(output[0].unsqueeze(0), do_sample=False, max_new_tokens=10, num_beams=2, logits_processor=[logits_processor],
num_return_sequences=1, repetition_penalty=1.5)
generations = tokenizer.batch_decode(output, skip_special_tokens=True)
print(generations)
# 'This is a valid json string for shopping cart:{ "name": "MyCart", "price": "10", "description": "MyCart'
###################################################
# We want to generate another valid json string
#
# 1. Create a new logits_processor with empty parser state
# 2. Use the same prompt as the input
##################################################
logits_processor = GrammarConstrainedLogitsProcessor(grammar)
output = model.generate(input_ids, do_sample=True, max_new_tokens=20, num_beams=2, logits_processor=[logits_processor],
num_return_sequences=1, repetition_penalty=1.5)
generations = tokenizer.batch_decode(output, skip_special_tokens=True)
print(generations)
# 'This is a valid json string for shopping cart:{ "name": "MyCart", "price": "10", "description": "MyCart' |
…ammar parsing state
Thanks @Saibo-creator, it makes sense not to reset the grammar state so that the user can continue the generation. One more minor correction, the rule for generating
instead of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
can we add a python gbnf file too ? Can take inspiration from : https://github.com/ggerganov/llama.cpp/blob/master/grammars/c.gbnf |
This should be related to the constrained decoding in Picard and Synchromesh. |
Hello @gante @ArthurZucker I'm excited to share that the feature is now in great shape, and I'm eager to hear your thoughts on it. The implementation of the grammar-constrained decoding feature is quite complex, as we aim to make it compatible with
It's relatively straightforward to integrate it with greedy search or greedy sampling. This leads me to my first question: Should we break down this feature into multiple versions, starting with a basic one, or would it be better to aim for a comprehensive solution in a single merge? From my perspective, once we have thoroughly tested greedy decoding and greedy sampling, it might be beneficial to merge them first, as they already cater to a wide range of use cases. Additionally, I'm facing some challenges in devising tests for this feature. Currently, I have a setup similar to what's outlined here, where I create simple grammars and verify the accuracy of the generation. However, establishing a systematic testing approach is tricky. For example, if we want to test the json grammar compatibility with all models, running the model with actual weights becomes necessary. Without the weights, the model might generate nonsensical but syntactically correct json outputs, which doesn't help in effective testing. While using actual weights does lead to valid json generation, it significantly slows down the process. I'd appreciate your insights on how to navigate these testing challenges. In the meantime, I'll continue refining the feature. : ) |
2.remove `grammar` from generation argument list, use `GrammarLogitsProcessor` instead
@ Saibo-creator I suggest we break down this feature into multiple versions, starting with a basic one. This creates motivation and encourages more people to collaborate, a greedy search for JSON sounds good for a start. |
Thank you @arshadshk for the feedback. I agree with you! In terms of greedy search and random sampling-based decoding, this feature should already be solid enough. And indeed json is the most popular use case for this feature, so we can add Unicode support a bit later. Now I'm working on crafting tests. It's a bit challenging to write tests for this feature. For example, I really want to have a TestMixin that tries to test every model to generate json objects. But as I explained above, this seems non-trivial. I will start with more atomic tests like this |
btw, @arshadshk, if you have time, could you also have a look at #27676 ? That PR tries to fix a bug which is important for this CFG feature to work properly, Thanks ! |
@Saibo-creator the (#27676) fix makes sense, I wonder if we open up probs for |
Hi @Saibo-creator 👋 It's great to see a project with a working example! I'd love to add it to My suggestion: let's add the code as is under P.S.: as a research project, you'd be able to make any changes you want with pretty much no barriers on our side ;) |
Sounds great! Thank you @gante ! |
@Saibo-creator sounds great! And don't let my conservative approach to your suggestions lower your enthusiasm, I'm enjoying your contributions :D |
I have tested the code in this PR and found that it works very nicely, so I borrowed it for my repository (with due credits): oobabooga/text-generation-webui#4953 It is more robust than the torch-grammar EBNF implementation that I was previously using, which would half the time throw and error while importing a seemingly valid grammar. Being able to generate structured output like json and lists for a given prompt has many practical applications and this logits processor makes that easy to setup, so I find it extremely valuable. |
Thanks @oobabooga, I was also able to test it successfully in HuggingFace TGI. It does work very well. |
i did set this up on fastapi and it only return result once |
Could you give a working example to show the problem ? I would be happy to investigate it. |
Since Transformers will not merge this in the near future, I have written a small extension library. The use is very straightforward. https://github.com/Saibo-creator/transformers-CFG/tree/main |
Hey! This is an interesting use case and I'm working on it. Will keep you updated. |
What does this PR do?
This PR adds a new feature (Context Free Grammar Constrained Decoding) to the library.
There is already one PR(WIP) for this feature( #26520 ), but this one has a different motivation and implementation.
This implementation is inspired by and adapted from https://github.com/Shopify/torch-grammar and https://github.com/ggerganov/llama.cpp/pull/1773/files
This implementation aims to achieve the following goals:
The two main differences from PR #26520 :
Challenges for this PR:
Current status:
TODO:
beam=1
grammar parser fails to parse llama-cpp's json grammar(more precisely the string line). Currently, a slightly simplified version of json grammar if used(now fixed)grammar parser requires the last rule ending with a new line, otherwise, parsing error will be raised. This is not user-friendly and should be fixedbeam_search
andbeam_sample
(Now throws errorRuntimeError: probability tensor contains either
inf,
nanor element < 0
). A good reference is theConstrainedBeamSearchScorer
Fixes # #25778
Related to PR # #26520
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@gante
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.