Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic Transcription #24

Open
doit-ceo opened this issue Dec 14, 2024 · 1 comment
Open

Arabic Transcription #24

doit-ceo opened this issue Dec 14, 2024 · 1 comment

Comments

@doit-ceo
Copy link

I did all the steps to generate the tflite and bin files, and included the decoder id

forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")

Arabic start to show up but with 50% missing letters

Mel spectrogram is calculated...!
2024-12-13 13:00:37.722 17057-17091 WhisperEngineJava       com.whispertflite                    D  output_len: 451
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50258, word: <|startoftranscript|>
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50272, word: <|ar|>
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  It is Transcription...
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50359, word: <|transcribe|>
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50363, word: <|notimestamps|>
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 21136, word: ĠاÙĦس
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 37440, word: ÙĦاÙħ
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 25894, word: ĠعÙĦÙĬ
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 24793, word: ÙĥÙħ
2024-12-13 13:00:37.725 17057-17091 WhisperEngineJava       com.whispertflite                    D  Inference is executed...!
2024-12-13 13:00:37.726 17057-17091 MainActivity            com.whispertflite                    D  Result: ?ا�?س�?ا�??ع�?�?�?�?

I chatgpt the problem and reached to this point, but I can't do progress any any more. I think it's not related to unicode issue, more likely the way the vocabulary file ignoring 50% of Arabic chars , I also tried using the files in py but I didn't manage to see any Arabic text at all

@vilassn
Copy link
Owner

vilassn commented Dec 18, 2024

Can you try with base or small model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants