Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix extra start token in phi3 tokenizer. Address #1063 #1065

Merged
merged 1 commit into from
Jun 11, 2024

Conversation

hmosousa
Copy link
Contributor

@hmosousa hmosousa commented Jun 6, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Please link to any issues this PR addresses.
Issue #1063

Changelog

What are the changes made in this PR?

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
    • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Copy link

pytorch-bot bot commented Jun 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1065

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit a6fd548 with merge base 161ac3d (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 6, 2024
@joecummings
Copy link
Contributor

Can you include a gist with your test confirming this works?

Thanks!

@hmosousa
Copy link
Contributor Author

Sure. Gist follows:

from torchtune.data import Message
from torchtune.models.phi3 import phi3_mini_tokenizer

messages = [
    Message(role="system", content="system content"),
    Message(role="user", content="users content")
]
tokenizer = phi3_mini_tokenizer("tmp/tokenizer.model")
input_ids, _ = tokenizer.tokenize_messages(messages, ignore_system_prompts=False)
print(input_ids)  # [1, 32010, 29871, 13, 7193, 2793, 32007, 29871, 13]

@RdoubleA
Copy link
Contributor

RdoubleA commented Jun 11, 2024

Sure. Gist follows:

from torchtune.data import Message
from torchtune.models.phi3 import phi3_mini_tokenizer

messages = [
    Message(role="system", content="system content"),
    Message(role="user", content="users content")
]
tokenizer = phi3_mini_tokenizer("tmp/tokenizer.model")
input_ids, _ = tokenizer.tokenize_messages(messages, ignore_system_prompts=False)
print(input_ids)  # [1, 32010, 29871, 13, 7193, 2793, 32007, 29871, 13]

why not just add this as a quick test :) you can use the sentencepiece test file we use here: https://github.com/pytorch/torchtune/blob/main/tests/torchtune/modules/tokenizers/test_sentencepiece.py

@ebsmothers
Copy link
Contributor

@hmosousa thanks for fixing this bug! I am actually gonna veto @RdoubleA's suggestion.. in this case you already did most of the important work by finding and fixing the bug with a reasonable repro. We should definitely have a unit test for the Phi3 tokenizer, and it should ideally test this behavior. But given no such unit test currently exists, I don't think it should be on you to add it in this PR. (If you want to add one separately we would of course love to have it, but we don't need to block this fix on it, and otherwise we can and should add it ourselves.)

@joecummings joecummings self-requested a review June 11, 2024 14:08
@joecummings joecummings merged commit 37d0be0 into pytorch:main Jun 11, 2024
24 of 25 checks passed
@hmosousa
Copy link
Contributor Author

Thanks @RdoubleA and @ebsmothers. I might be able to add the tests. Will create a new issue for that.

@hmosousa hmosousa deleted the fix_phi_tkn branch June 11, 2024 15:54
maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Phi3 tokenizer adds a extra start token when ignore_system_prompts=True
5 participants