-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix extra start token in phi3 tokenizer. Address #1063 #1065
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1065
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit a6fd548 with merge base 161ac3d (): NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Can you include a gist with your test confirming this works? Thanks! |
Sure. Gist follows: from torchtune.data import Message
from torchtune.models.phi3 import phi3_mini_tokenizer
messages = [
Message(role="system", content="system content"),
Message(role="user", content="users content")
]
tokenizer = phi3_mini_tokenizer("tmp/tokenizer.model")
input_ids, _ = tokenizer.tokenize_messages(messages, ignore_system_prompts=False)
print(input_ids) # [1, 32010, 29871, 13, 7193, 2793, 32007, 29871, 13] |
why not just add this as a quick test :) you can use the sentencepiece test file we use here: https://github.com/pytorch/torchtune/blob/main/tests/torchtune/modules/tokenizers/test_sentencepiece.py |
@hmosousa thanks for fixing this bug! I am actually gonna veto @RdoubleA's suggestion.. in this case you already did most of the important work by finding and fixing the bug with a reasonable repro. We should definitely have a unit test for the Phi3 tokenizer, and it should ideally test this behavior. But given no such unit test currently exists, I don't think it should be on you to add it in this PR. (If you want to add one separately we would of course love to have it, but we don't need to block this fix on it, and otherwise we can and should add it ourselves.) |
Thanks @RdoubleA and @ebsmothers. I might be able to add the tests. Will create a new issue for that. |
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses.
Issue #1063
Changelog
What are the changes made in this PR?
Test plan
Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)
pre-commit install
)pytest tests
pytest tests -m integration_test