[Bug]: MistralTokenizer Detokenization Issue #8627

ywang96 · 2024-09-19T09:52:33Z

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

Model Input Dumps

Code to repro

from pathlib import Path

from huggingface_hub import snapshot_download
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from vllm import LLM
from vllm.sampling_params import SamplingParams


model_name = "mistralai/Pixtral-12B-2409"
mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id=model_name, allow_patterns=["tekken.json"], local_dir=mistral_models_path)
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json") # MistralTokenizer

sampling_params = SamplingParams(temperature=0.0, max_tokens=8192)

llm = LLM(model=model_name, tokenizer_mode="mistral", enforce_eager=True)

prompt = "這個圖片是什麼"
image_url = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"

messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]
    },
]

outputs = llm.chat(messages, sampling_params=sampling_params)

print("vllm: " + outputs[0].outputs[0].text) # vLLM text output
print(outputs[0].outputs[0].token_ids)
print("detok: " + tokenizer.decode(outputs[0].outputs[0].token_ids[:-1])) # skip the last token_id = 2

🐛 Describe the bug

When the engine is initialized with tokenizer_model="mistral", there's some encoding error when it comes to certain languages. However, when using initialized MistralTokenizer to decode the token ids directly there's no such issue.

Output from the above code

Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.11s/it, est. speed input: 346.06 toks/s, output: 28.72 toks/s]
vllm: 图片展示了一幅��丽的自然景观，主要是一条������的河流��过一片宁静的草地，周��环��着高耸的岩石����和��木。河流清��见底，水面平静，周��散布着岩石和��色��被。河流两岸的草地上点��着各种��物和��木，营造出宁静的����。背景中的岩石����高大险��，直��云��，增��了场景的宏��感。天空��朗，点��着几��云彩，暗示着一个明亮、��朗的日子。图片中没有明显的文字或人造物品，突出了自然的美丽。整体����宁静而��丽，突显了大自然的宏��和宁静。
(16442, 49395, 60288, 21552, 30841, 117293, 6693, 1174, 62326, 2713, 43090, 79088, 44885, 1625, 125192, 2499, 3087, 17624, 1232, 1156, 1191, 1232, 1156, 1146, 2713, 49563, 45605, 16842, 1191, 5984, 3087, 49395, 109042, 49554, 2713, 87781, 8736, 1625, 22675, 2854, 1180, 105080, 6046, 1149, 9883, 14370, 129695, 2713, 125632, 40801, 24934, 1173, 6693, 1129, 4300, 4901, 1145, 23942, 1320, 49563, 45605, 37202, 53760, 1136, 13594, 26800, 1625, 24777, 8682, 7210, 49554, 1625, 22675, 2854, 1180, 83632, 25120, 9883, 125632, 40801, 4300, 6046, 1191, 26416, 83777, 1141, 24443, 1320, 49563, 45605, 36987, 122890, 2713, 87781, 8736, 4445, 9079, 29532, 1128, 9883, 36283, 14164, 83777, 1141, 16307, 4300, 4901, 1145, 23942, 1625, 121634, 35747, 7059, 109042, 49554, 2713, 7020, 1155, 2854, 1180, 1320, 55022, 79088, 56245, 125632, 40801, 24934, 1173, 6693, 1129, 14370, 5368, 124592, 24934, 1187, 1625, 13334, 19528, 1146, 56212, 26985, 1132, 1625, 44290, 23295, 1187, 4836, 50381, 79088, 2713, 126928, 5596, 1159, 27934, 1320, 6434, 26095, 4343, 1180, 52678, 1625, 9079, 29532, 1128, 9883, 29538, 1632, 1181, 56212, 96037, 1625, 121028, 21552, 9883, 26535, 8560, 88518, 1749, 4343, 1180, 52678, 2713, 1866, 8390, 1320, 16442, 49395, 4392, 16685, 66876, 2713, 121873, 10443, 3405, 35747, 16307, 20353, 1625, 21949, 7059, 4836, 43090, 2713, 8350, 62326, 1320, 60896, 18807, 7020, 1155, 2854, 1180, 109042, 49554, 4262, 6693, 1174, 62326, 1625, 21949, 21802, 4836, 5368, 43090, 2713, 126928, 5596, 1159, 4300, 109042, 49554, 1320, 2)
detok: 图片展示了一幅壮丽的自然景观，主要是一条蜿蜒的河流穿过一片宁静的草地，周围环绕着高耸的岩石峭壁和树木。河流清澈见底，水面平静，周围散布着岩石和绿色植被。河流两岸的草地上点缀着各种植物和树木，营造出宁静的氛围。背景中的岩石峭壁高大险峻，直插云霄，增添了场景的宏伟感。天空晴朗，点缀着几朵云彩，暗示着一个明亮、晴朗的日子。图片中没有明显的文字或人造物品，突出了自然的美丽。整体氛围宁静而壮丽，突显了大自然的宏伟和宁静。

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

ywang96 · 2024-09-19T09:55:52Z

cc @patrickvonplaten - I haven't spent too much time on debugging why there's such inconsistency but only found out it's an issue on vLLM since we were very recently informed by Chatbot Arena about it, so it would be great if you can take a look or if you might have an idea why this is happening so we can fix it asap. Thanks!

patrickvonplaten · 2024-09-19T11:27:03Z

Hey @ywang96,

Thanks for the ping - checking!

ywang96 · 2024-09-19T16:32:45Z

Just confirmed this is happening on text-only models so there's indeed something wrong with the detok on vLLM now...

model_name = "mistralai/Mistral-Nemo-Instruct-2407"
mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id=model_name, allow_patterns=["tekken.json"], local_dir=mistral_models_path)
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json") # MistralTokenizer

sampling_params = SamplingParams(temperature=0.0, max_tokens=8192)

llm = LLM(model=model_name, tokenizer_mode="mistral", enforce_eager=True, tensor_parallel_size=8)

prompt = "今天天气如何？"
messages = [
    {
        "role": "user",
        "content": prompt,
    },
]
outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text) # vLLM text output
print(outputs[0].outputs[0].token_ids)
print(tokenizer.decode(outputs[0].outputs[0].token_ids[:-1]))

Output:

很抱歉，我无法提供实时天气信息，因为我是一个文本生成模型，我无法��问实时数据。但是，您可以���索您所在地区的天气��报，或者查看当地的天气应用程序来获取最新的天气信息。
(13440, 81040, 1625, 3621, 13244, 10628, 113521, 6892, 4022, 6434, 35459, 15690, 47424, 1625, 14966, 3621, 2499, 26535, 11449, 5296, 7360, 5862, 86061, 24308, 1625, 3621, 13244, 10628, 5538, 1191, 9915, 6892, 4022, 128593, 1320, 5859, 1625, 48423, 18921, 1230, 6423, 73291, 48423, 5536, 2998, 71867, 2713, 6434, 35459, 12684, 1132, 24549, 1625, 22516, 37706, 9764, 5342, 8736, 2713, 6434, 35459, 34590, 12600, 31479, 55550, 4976, 68826, 32128, 7695, 11795, 2713, 6434, 35459, 15690, 47424, 1320, 2)
很抱歉，我无法提供实时天气信息，因为我是一个文本生成模型，我无法访问实时数据。但是，您可以搜索您所在地区的天气预报，或者查看当地的天气应用程序来获取最新的天气信息。

As far as I can tell, this is happening to Korean/Hangul too. I will take a look at it too if I have some bandwidth today!

patrickvonplaten · 2024-09-19T17:44:39Z

Hey @ywang96,

Yes here is a fix: #8640

Essentially the problems comes from the following:

The tokenizers works on unicode bytes
When you decode token-by-token on the fly (which is done here), it might happen that you're encoding invalid unicodes. This is then converted into the � symbol and at that point the id is lost. This is actually very much expected - what we need to do in this case is to wait until the next token because we need to know the next token until we can correctly decode

The PR liked above should fix it

BabyChouSr · 2024-09-20T17:31:50Z

@patrickvonplaten Thank you for your great work! I was using your branch but I hit a weird issue. There seems to be a KeyError when decoding some Chinese characters.

Prompt:

Error:

patrickvonplaten · 2024-09-21T00:06:59Z

Hey @BabyChouSr,

Can you try again with current "main" and if it still fails can you post a reproducible code snippet here? :-)

prashantgupta24 · 2024-10-21T17:37:20Z

@patrickvonplaten I have the error reproduced here - #9557

ywang96 added the bug Something isn't working label Sep 19, 2024

vllm-project deleted a comment Sep 19, 2024

patrickvonplaten mentioned this issue Sep 19, 2024

[Bugfix][Core] Fix tekken edge case for mistral tokenizer #8640

Merged

simon-mo closed this as completed in #8640 Sep 20, 2024

ywang96 reopened this Sep 20, 2024

prashantgupta24 mentioned this issue Oct 21, 2024

[Bug]: MistralTokenizer Detokenization Issue #9557

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: MistralTokenizer Detokenization Issue #8627

[Bug]: MistralTokenizer Detokenization Issue #8627

ywang96 commented Sep 19, 2024

ywang96 commented Sep 19, 2024

patrickvonplaten commented Sep 19, 2024

ywang96 commented Sep 19, 2024 •

edited

Loading

patrickvonplaten commented Sep 19, 2024

BabyChouSr commented Sep 20, 2024

patrickvonplaten commented Sep 21, 2024

prashantgupta24 commented Oct 21, 2024

[Bug]: MistralTokenizer Detokenization Issue #8627

[Bug]: MistralTokenizer Detokenization Issue #8627

Comments

ywang96 commented Sep 19, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

ywang96 commented Sep 19, 2024

patrickvonplaten commented Sep 19, 2024

ywang96 commented Sep 19, 2024 • edited Loading

patrickvonplaten commented Sep 19, 2024

BabyChouSr commented Sep 20, 2024

patrickvonplaten commented Sep 21, 2024

prashantgupta24 commented Oct 21, 2024

ywang96 commented Sep 19, 2024 •

edited

Loading