Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the bug related to word splitting errors in the "tokenize" function. #760

Merged
merged 1 commit into from
Apr 14, 2023

Conversation

AfryMask
Copy link
Contributor

For example, we have prompts: "平凡"

平 (0xE5 0xB9 0xB3)
凡 (0xE5 0x87 0xA1)

We get tokens:

16716 (0xE5 0xB9 0xB3)
161 (0xE5)
229 (0x87)
94 (0xA1)

But the right tokens are:

16716 (0xE5 0xB9 0xB3)
6336 (0xE5 0x87)
94 (0xA1)

@ggerganov ggerganov merged commit 7e2afa4 into ggerganov:master Apr 14, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
…ze" function. (ggerganov#760)

Co-authored-by: AfryMask <afrymask@gmail.com>
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
…ze" function. (ggerganov#760)

Co-authored-by: AfryMask <afrymask@gmail.com>
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
…ze" function. (ggerganov#760)

Co-authored-by: AfryMask <afrymask@gmail.com>
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
…ze" function. (ggerganov#760)

Co-authored-by: AfryMask <afrymask@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants