AutoTokenizer Encode Error #23671

congyingxia · 2023-05-22T23:51:07Z

System Info

For the LlamaTokenizer, I can get correct encoding result when directly loading from LlamaTokenizer. But the results are incorrect when using AutoTokenizer. Another issue is loading the AutoTokenizer much slower than directly loading the LlamaTokenizer. It take around 4 mins to load the tokenizer from the path when using AutoTokenizer, while it only takes one second if directly using the LlamaTokenizer.

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Python version: 3.8.16
transformers version: 4.28.1

Follow the given example:

from transformers import LlamaTokenizer, AutoTokenizer

model_path = 'openlm-research/open_llama_7b_700bt_preview'
str = ' is embarassed, because Samantha made snide comments about the shirt Rebecca was wearing.'

tokenizer1 = LlamaTokenizer.from_pretrained(model_path)
tokenizer2 = AutoTokenizer.from_pretrained(model_path)

ret1 = tokenizer1.encode(str, add_special_tokens=False)
ret2 = tokenizer2.encode(str, add_special_tokens=False)

print(ret1)
print(ret2)

Expected behavior

ret1: [322, 2661, 285, 14363, 31844, 906, 23982, 985, 3668, 483, 4309, 562, 266, 13803, 15136, 393, 7732, 31843]
ret2: [31822, 322, 2661, 285, 14363, 31844, 906, 23982, 985, 3668, 483, 4309, 562, 266, 13803, 15136, 393, 7732, 31843]

ret1 is the expected output and ret2 is an error result from AutoTokenizer. AutoTokenizer add an additional token, 31822 (which is a space token), to the encoding results.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-05-24T12:35:27Z

Hey! You are using an old version of the tokenizer. You should be using the one available here. This issue was already fixed.

AutoTokenizer has to convert the slow tokenizer to a fast one, which takes of course a lot of time since the model was not saved on the shared repo.

github-actions · 2023-06-22T15:02:12Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

lbeurerkellner mentioned this issue Jun 23, 2023

Running with OpenLlama takes forever eth-sri/lmql#98

Closed

github-actions bot closed this as completed Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTokenizer Encode Error #23671

AutoTokenizer Encode Error #23671

congyingxia commented May 22, 2023 •

edited

Loading

ArthurZucker commented May 24, 2023

github-actions bot commented Jun 22, 2023

AutoTokenizer Encode Error #23671

AutoTokenizer Encode Error #23671

Comments

congyingxia commented May 22, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented May 24, 2023

github-actions bot commented Jun 22, 2023

congyingxia commented May 22, 2023 •

edited

Loading