-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: MistralTokenizer Detokenization Issue #8627
Comments
cc @patrickvonplaten - I haven't spent too much time on debugging why there's such inconsistency but only found out it's an issue on vLLM since we were very recently informed by Chatbot Arena about it, so it would be great if you can take a look or if you might have an idea why this is happening so we can fix it asap. Thanks! |
Hey @ywang96, Thanks for the ping - checking! |
Just confirmed this is happening on text-only models so there's indeed something wrong with the detok on vLLM now... model_name = "mistralai/Mistral-Nemo-Instruct-2407"
mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id=model_name, allow_patterns=["tekken.json"], local_dir=mistral_models_path)
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json") # MistralTokenizer
sampling_params = SamplingParams(temperature=0.0, max_tokens=8192)
llm = LLM(model=model_name, tokenizer_mode="mistral", enforce_eager=True, tensor_parallel_size=8)
prompt = "今天天气如何?"
messages = [
{
"role": "user",
"content": prompt,
},
]
outputs = llm.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text) # vLLM text output
print(outputs[0].outputs[0].token_ids)
print(tokenizer.decode(outputs[0].outputs[0].token_ids[:-1])) Output:
As far as I can tell, this is happening to Korean/Hangul too. I will take a look at it too if I have some bandwidth today! |
Hey @ywang96, Yes here is a fix: #8640 Essentially the problems comes from the following:
The PR liked above should fix it |
@patrickvonplaten Thank you for your great work! I was using your branch but I hit a weird issue. There seems to be a |
Hey @BabyChouSr, Can you try again with current "main" and if it still fails can you post a reproducible code snippet here? :-) |
@patrickvonplaten I have the error reproduced here - #9557 |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
Code to repro
🐛 Describe the bug
When the engine is initialized with
tokenizer_model="mistral"
, there's some encoding error when it comes to certain languages. However, when using initializedMistralTokenizer
to decode the token ids directly there's no such issue.Output from the above code
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: