Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: MistralTokenizer Detokenization Issue #8627

Open
1 task done
ywang96 opened this issue Sep 19, 2024 · 7 comments · Fixed by #8640
Open
1 task done

[Bug]: MistralTokenizer Detokenization Issue #8627

ywang96 opened this issue Sep 19, 2024 · 7 comments · Fixed by #8640
Labels
bug Something isn't working

Comments

@ywang96
Copy link
Member

ywang96 commented Sep 19, 2024

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

Model Input Dumps

Code to repro

from pathlib import Path

from huggingface_hub import snapshot_download
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from vllm import LLM
from vllm.sampling_params import SamplingParams


model_name = "mistralai/Pixtral-12B-2409"
mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id=model_name, allow_patterns=["tekken.json"], local_dir=mistral_models_path)
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json") # MistralTokenizer

sampling_params = SamplingParams(temperature=0.0, max_tokens=8192)

llm = LLM(model=model_name, tokenizer_mode="mistral", enforce_eager=True)

prompt = "這個圖片是什麼"
image_url = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"

messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]
    },
]

outputs = llm.chat(messages, sampling_params=sampling_params)

print("vllm: " + outputs[0].outputs[0].text) # vLLM text output
print(outputs[0].outputs[0].token_ids)
print("detok: " + tokenizer.decode(outputs[0].outputs[0].token_ids[:-1])) # skip the last token_id = 2

🐛 Describe the bug

When the engine is initialized with tokenizer_model="mistral", there's some encoding error when it comes to certain languages. However, when using initialized MistralTokenizer to decode the token ids directly there's no such issue.

Output from the above code

Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.11s/it, est. speed input: 346.06 toks/s, output: 28.72 toks/s]
vllm: 图片展示了一幅��丽的自然景观,主要是一条������的河流��过一片宁静的草地,周��环��着高耸的岩石����和��木。河流清��见底,水面平静,周��散布着岩石和��色��被。河流两岸的草地上点��着各种��物和��木,营造出宁静的����。背景中的岩石����高大险��,直��云��,增��了场景的宏��感。天空��朗,点��着几��云彩,暗示着一个明亮、��朗的日子。图片中没有明显的文字或人造物品,突出了自然的美丽。整体����宁静而��丽,突显了大自然的宏��和宁静。
(16442, 49395, 60288, 21552, 30841, 117293, 6693, 1174, 62326, 2713, 43090, 79088, 44885, 1625, 125192, 2499, 3087, 17624, 1232, 1156, 1191, 1232, 1156, 1146, 2713, 49563, 45605, 16842, 1191, 5984, 3087, 49395, 109042, 49554, 2713, 87781, 8736, 1625, 22675, 2854, 1180, 105080, 6046, 1149, 9883, 14370, 129695, 2713, 125632, 40801, 24934, 1173, 6693, 1129, 4300, 4901, 1145, 23942, 1320, 49563, 45605, 37202, 53760, 1136, 13594, 26800, 1625, 24777, 8682, 7210, 49554, 1625, 22675, 2854, 1180, 83632, 25120, 9883, 125632, 40801, 4300, 6046, 1191, 26416, 83777, 1141, 24443, 1320, 49563, 45605, 36987, 122890, 2713, 87781, 8736, 4445, 9079, 29532, 1128, 9883, 36283, 14164, 83777, 1141, 16307, 4300, 4901, 1145, 23942, 1625, 121634, 35747, 7059, 109042, 49554, 2713, 7020, 1155, 2854, 1180, 1320, 55022, 79088, 56245, 125632, 40801, 24934, 1173, 6693, 1129, 14370, 5368, 124592, 24934, 1187, 1625, 13334, 19528, 1146, 56212, 26985, 1132, 1625, 44290, 23295, 1187, 4836, 50381, 79088, 2713, 126928, 5596, 1159, 27934, 1320, 6434, 26095, 4343, 1180, 52678, 1625, 9079, 29532, 1128, 9883, 29538, 1632, 1181, 56212, 96037, 1625, 121028, 21552, 9883, 26535, 8560, 88518, 1749, 4343, 1180, 52678, 2713, 1866, 8390, 1320, 16442, 49395, 4392, 16685, 66876, 2713, 121873, 10443, 3405, 35747, 16307, 20353, 1625, 21949, 7059, 4836, 43090, 2713, 8350, 62326, 1320, 60896, 18807, 7020, 1155, 2854, 1180, 109042, 49554, 4262, 6693, 1174, 62326, 1625, 21949, 21802, 4836, 5368, 43090, 2713, 126928, 5596, 1159, 4300, 109042, 49554, 1320, 2)
detok: 图片展示了一幅壮丽的自然景观,主要是一条蜿蜒的河流穿过一片宁静的草地,周围环绕着高耸的岩石峭壁和树木。河流清澈见底,水面平静,周围散布着岩石和绿色植被。河流两岸的草地上点缀着各种植物和树木,营造出宁静的氛围。背景中的岩石峭壁高大险峻,直插云霄,增添了场景的宏伟感。天空晴朗,点缀着几朵云彩,暗示着一个明亮、晴朗的日子。图片中没有明显的文字或人造物品,突出了自然的美丽。整体氛围宁静而壮丽,突显了大自然的宏伟和宁静。

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@ywang96 ywang96 added the bug Something isn't working label Sep 19, 2024
@ywang96
Copy link
Member Author

ywang96 commented Sep 19, 2024

cc @patrickvonplaten - I haven't spent too much time on debugging why there's such inconsistency but only found out it's an issue on vLLM since we were very recently informed by Chatbot Arena about it, so it would be great if you can take a look or if you might have an idea why this is happening so we can fix it asap. Thanks!

@vllm-project vllm-project deleted a comment Sep 19, 2024
@patrickvonplaten
Copy link
Contributor

Hey @ywang96,

Thanks for the ping - checking!

@ywang96
Copy link
Member Author

ywang96 commented Sep 19, 2024

Just confirmed this is happening on text-only models so there's indeed something wrong with the detok on vLLM now...

model_name = "mistralai/Mistral-Nemo-Instruct-2407"
mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')
mistral_models_path.mkdir(parents=True, exist_ok=True)
snapshot_download(repo_id=model_name, allow_patterns=["tekken.json"], local_dir=mistral_models_path)
tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json") # MistralTokenizer

sampling_params = SamplingParams(temperature=0.0, max_tokens=8192)

llm = LLM(model=model_name, tokenizer_mode="mistral", enforce_eager=True, tensor_parallel_size=8)

prompt = "今天天气如何?"
messages = [
    {
        "role": "user",
        "content": prompt,
    },
]
outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text) # vLLM text output
print(outputs[0].outputs[0].token_ids)
print(tokenizer.decode(outputs[0].outputs[0].token_ids[:-1])) 

Output:

很抱歉,我无法提供实时天气信息,因为我是一个文本生成模型,我无法��问实时数据。但是,您可以���索您所在地区的天气��报,或者查看当地的天气应用程序来获取最新的天气信息。
(13440, 81040, 1625, 3621, 13244, 10628, 113521, 6892, 4022, 6434, 35459, 15690, 47424, 1625, 14966, 3621, 2499, 26535, 11449, 5296, 7360, 5862, 86061, 24308, 1625, 3621, 13244, 10628, 5538, 1191, 9915, 6892, 4022, 128593, 1320, 5859, 1625, 48423, 18921, 1230, 6423, 73291, 48423, 5536, 2998, 71867, 2713, 6434, 35459, 12684, 1132, 24549, 1625, 22516, 37706, 9764, 5342, 8736, 2713, 6434, 35459, 34590, 12600, 31479, 55550, 4976, 68826, 32128, 7695, 11795, 2713, 6434, 35459, 15690, 47424, 1320, 2)
很抱歉,我无法提供实时天气信息,因为我是一个文本生成模型,我无法访问实时数据。但是,您可以搜索您所在地区的天气预报,或者查看当地的天气应用程序来获取最新的天气信息。

As far as I can tell, this is happening to Korean/Hangul too. I will take a look at it too if I have some bandwidth today!

@patrickvonplaten
Copy link
Contributor

Hey @ywang96,

Yes here is a fix: #8640

Essentially the problems comes from the following:

  • The tokenizers works on unicode bytes
  • When you decode token-by-token on the fly (which is done here), it might happen that you're encoding invalid unicodes. This is then converted into the � symbol and at that point the id is lost. This is actually very much expected - what we need to do in this case is to wait until the next token because we need to know the next token until we can correctly decode

The PR liked above should fix it

@BabyChouSr
Copy link

@patrickvonplaten Thank you for your great work! I was using your branch but I hit a weird issue. There seems to be a KeyError when decoding some Chinese characters.

Prompt:
image

Error:
image

@ywang96 ywang96 reopened this Sep 20, 2024
@patrickvonplaten
Copy link
Contributor

Hey @BabyChouSr,

Can you try again with current "main" and if it still fails can you post a reproducible code snippet here? :-)

@prashantgupta24
Copy link
Contributor

@patrickvonplaten I have the error reproduced here - #9557

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants