Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The test LlamaIntegrationTest::test_conversion test is failing #23400

Closed
ydshieh opened this issue May 16, 2023 · 4 comments · Fixed by #23408
Closed

The test LlamaIntegrationTest::test_conversion test is failing #23400

ydshieh opened this issue May 16, 2023 · 4 comments · Fixed by #23408

Comments

@ydshieh
Copy link
Collaborator

ydshieh commented May 16, 2023

The following command

RUN_SLOW=1 python3 -m pytest -v tests/models/llama/test_tokenization_llama.py::LlamaIntegrationTest::test_conversion

gives

>           self.assertEqual(old_serialized, new_serialized)
E           AssertionError: '{\n [1465 chars]   "Sequence": {\n          "id": "B",\n      [1794589 chars]}\n}' != '{\n [1465 chars]   "SpecialToken": {\n          "id": "<s>",\n[1794837 chars]}\n}'

tests/models/llama/test_tokenization_llama.py:337: AssertionError

Who can help?

@ArthurZucker

@Narsil
Copy link
Contributor

Narsil commented May 16, 2023

I looked into it.

The difference is that the newly converted tokenizer has ids 32000-32004 as special ids which correspond if I'm not mistaken to OpenAssistant llama fork.

Those do not seem to be declared here: https://huggingface.co/hf-internal-testing/llama-tokenizer/tree/main

I'm not sure which part of the code adds them to the slow tokenizer, but this seems indeed like a bug.

Looked at the wrong file. Everything works it's only a different type_id in the post processor.

We simply need to update the tokenizer.json on the hub with the correct value (1)

@Narsil
Copy link
Contributor

Narsil commented May 16, 2023

(There's also a slight issue with the EOS token being added into the processor for no reason.

@Narsil
Copy link
Contributor

Narsil commented May 16, 2023

https://huggingface.co/hf-internal-testing/llama-tokenizer/discussions/3

Goes along with

#23400

@ydshieh
Copy link
Collaborator Author

ydshieh commented May 16, 2023

Confirmed it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants