Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Tokenizer Loading #276

Open
IIEleven11 opened this issue Jul 22, 2024 · 28 comments
Open

New Tokenizer Loading #276

IIEleven11 opened this issue Jul 22, 2024 · 28 comments

Comments

@IIEleven11
Copy link

IIEleven11 commented Jul 22, 2024

🔴 If you have installed AllTalk in a custom Python environment, I will only be able to provide limited assistance/support. AllTalk draws on a variety of scripts and libraries that are not written or managed by myself, and they may fail, error or give strange results in custom built python environments.

🔴 Please generate a diagnostics report and upload the "diagnostics.log" as this helps me understand your configuration.

https://github.com/erew
diagnostics.log
123/alltalk_tts/tree/main?#-how-to-make-a-diagnostics-report-file

Describe the bug
A clear and concise description of what the bug is.
The script doesn't load the new custom tokenizer
To Reproduce
Steps to reproduce the behaviour:
Setup dataset, check create new tokenizer, proceed through the process and begin training.

  1. Checking config.json shows the loaded vocab.json is of the xttsv2 base model that was chosen and not the custom tokenizer we just made.
  2. To confirm, during inference, model requires the base model vocab.json. When swapping in the custom vocab.json it throws embedding size error

Screenshots
If applicable, add screenshots to help explain your problem.
image
image

Text/logs
If applicable, copy/paste in your logs here from the console.
config.json from last run. You can see the path/tokenizer loaded
config.json

I can share output from the terminal when trying to load the incorrect tokenizer if you want. But it is just prints the model keys and it is a lot of text. Then throws the embedding mismatch error.

Desktop (please complete the following information):
AllTalk was updated: [approx. date]. I installed it yesterday
Custom Python environment: [yes/no give details if yes] no
Text-generation-webUI was updated: [approx. date] no

Additional context
Add any other context about the problem here.
The solution is to just load the custom tokenizer instead of the base model tokenizer. You'll probably have to rename it to vocab.json.
We might have to resize the embeddings layer of the base model to accommodate the new embeddings.

@erew123
Copy link
Owner

erew123 commented Jul 22, 2024

Hi @IIEleven11

Im just going to document how its used in the code/my rough understanding of it.

When the "BPE Tokenizer" is selected, it creates a bpe_tokenizer-vocab.json in the root of the tmp-trn folder:

image

In the Stage 2 training code, when starting up training, we look for the existence of bpe_tokenizer-vocab.json and add it to training_assets

image

The primary/standard vocab.json file is still set against the models initialisation setup/config as model_args:

image

model_args is then loaded as part of the config:

image

Then when we get as far as initialising the actual trainer, we load the config which includes the vocab.json AND the additional bpe_tokenizer-vocab.json as an asset to merge into the setup for this training session.

image

The original vocab.json file remains untouched and the bpe_tokenizer-vocab.json is merged/used for the duration of training.

So 2 things at this point:

1) I am unable to re-create the crash/issue you had (as follows):

"Error(s) in loading state_dict for Xtts: size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([3491, 1024]).
size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([3491, 1024]).
size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6681]) from checkpoint, the shape in current model is torch.Size([3491])."

Or

   size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([6153, 1024]).
    size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([6153, 1024]).
    size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6681]) from checkpoint, the shape in current model is torch.Size([6153]).

So I'm not sure that those errors and the BPE tokenizer are related issues. You can delete the bpe_tokenizer-vocab.json file and re-load finetuning, click on Step 1 which will detect the existing CSV dataset files and populate Step 2. You can then run Step 2 which wont find the bpe_tokenizer-vocab.json so wont add that to the training_assets and we can see if you get the same error/issue. At least we can negate if it is related to the BPE tokenizer. If the issue doesn't exist, please can you provide other details about language you are training, are the settings default otherwise or have you changed other things, does your dataset look like it has multiple audio files and do things show up if you run dataset validation (just curious how many files it thinks there are etc).

2) This specific bit of code (addition of BPE tokenizer) was not code I had added/written.

This code has been added over 2x PR requests and additions, with the last PR being here Alltalkbeta enhancements #255

I am not personally claiming to be an expert in the tokenizer setup. So what I am thinking is I will see if the original submitter of the PR is free/able to get involved in this conversation/issue here and between us we can discuss if there is an issue with the way its operating and bounce some ideas around. Does that sound reasonable?

Thanks

@IIEleven11
Copy link
Author

Hi @IIEleven11

Im just going to document how its used in the code/my rough understanding of it.

When the "BPE Tokenizer" is selected, it creates a bpe_tokenizer-vocab.json in the root of the tmp-trn folder:

image

In the Stage 2 training code, when starting up training, we look for the existence of bpe_tokenizer-vocab.json and add it to training_assets

image

The primary/standard vocab.json file is still set against the models initialisation setup/config as model_args:

image

model_args is then loaded as part of the config:

image

Then when we get as far as initialising the actual trainer, we load the config which includes the vocab.json AND the additional bpe_tokenizer-vocab.json as an asset to merge into the setup for this training session.

image

The original vocab.json file remains untouched and the bpe_tokenizer-vocab.json is merged/used for the duration of training.

So 2 things at this point:

1) I am unable to re-create the crash/issue you had (as follows):

"Error(s) in loading state_dict for Xtts: size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([3491, 1024]).
size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([3491, 1024]).
size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6681]) from checkpoint, the shape in current model is torch.Size([3491])."

Or

   size mismatch for gpt.text_embedding.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([6153, 1024]).
    size mismatch for gpt.text_head.weight: copying a param with shape torch.Size([6681, 1024]) from checkpoint, the shape in current model is torch.Size([6153, 1024]).
    size mismatch for gpt.text_head.bias: copying a param with shape torch.Size([6681]) from checkpoint, the shape in current model is torch.Size([6153]).

So I'm not sure that those errors and the BPE tokenizer are related issues. You can delete the bpe_tokenizer-vocab.json file and re-load finetuning, click on Step 1 which will detect the existing CSV dataset files and populate Step 2. You can then run Step 2 which wont find the bpe_tokenizer-vocab.json so wont add that to the training_assets and we can see if you get the same error/issue. At least we can negate if it is related to the BPE tokenizer. If the issue doesn't exist, please can you provide other details about language you are training, are the settings default otherwise or have you changed other things, does your dataset look like it has multiple audio files and do things show up if you run dataset validation (just curious how many files it thinks there are etc).

2) This specific bit of code (addition of BPE tokenizer) was not code I had added/written.

This code has been added over 2x PR requests and additions, with the last PR being here Alltalkbeta enhancements #255

I am not personally claiming to be an expert in the tokenizer setup. So what I am thinking is I will see if the original submitter of the PR is free/able to get involved in this conversation/issue here and between us we can discuss if there is an issue with the way its operating and bounce some ideas around. Does that sound reasonable?

Thanks

Hmmm. I see... I didn't notice the merge with the original vocab. I'm going to run 4 quick epochs and check again. I may of grabbed the wrong vocab.json.

I've been looking into solving this and the size of the embeddings layer is just the number of the key:value pairs in the vocab.json. So I think an easy way to confirm would be to check the created vocab.json key value pair count and compare it with the base model count. They should be different because we're adding vocabulary. I'll check here shortly

@IIEleven11
Copy link
Author

Ok so they are the same. 6680.
this is the vocab pulled from the training folder after 4 epochs
testingset_training_xttsft-july-22-2024-01-27am-b03e084.json

this is the vocab pulled from the base model xttsv2_2.0.3 folder
vocab.json

I could be wrong about the size here though. Maybe that number of key value pairs must stay the same?

@erew123
Copy link
Owner

erew123 commented Jul 22, 2024

My knowledge on this specific aspect is 50/50 at best and only what Ive personally read up, so feel free to take my answer with a pinch of salt.

My belief/understanding from https://huggingface.co/learn/nlp-course/en/chapter6/5 and https://huggingface.co/docs/transformers/en/tokenizer_summary was that the BPE tokenizer was purely a "during training". From the first link:

After getting this base vocabulary, we add new tokens until the desired vocabulary size is reached by learning merges, which are rules to merge two elements of the existing vocabulary together into a new one. So, at the beginning these merges will create tokens with two characters, and then, as training progresses, longer subwords.

So that should cover it using the original vocab.json and then adding the BPE tokenizer vocab for training. Looking over other code related to BPE tokenizers, I haven't seen any code that specifically alters the original tokenizer (finetuing), bar training a absolute brand new, from the ground up model. So my belief/understanding was this BPE tokenizer just extends the vocab.json during the process of finetuning the XTTS model, but shouldn't/doesn't alter the original vocab.json as that's required to maintain compatibility. As I say, this is my current understanding.

Although I'm not saying this is a 100% confirmation, based on the question, I've also thrown the finetuning code, other documentation (transformers etc), at ChatGPT to see what we get there as an answer. So the answer there was:


Training Only:

The BPE tokenizer defined in bpe_tokenizer-vocab.json would most likely be used only during the training process. This is especially true if it's a custom tokenizer created specifically for the fine-tuning dataset. It allows the model to handle specific vocabulary or tokenization needs of the fine-tuning data without permanently altering the base model's vocabulary.

Not Merged into Final vocab.json:

I would not expect the custom BPE tokenizer to be merged into the final vocab.json of the model. The reason is:

Maintaining Compatibility: Keeping the original vocabulary allows the fine-tuned model to remain compatible with the base model's tokenizer. This is crucial for interoperability and using the model in various downstream tasks.
Avoiding Vocabulary Explosion: Merging vocabularies can lead to an unnecessarily large vocabulary, which can increase model size and potentially impact performance.

Separate Tokenizer for Inference:

If the custom tokenizer is essential for using the fine-tuned model correctly, I would expect it to be saved separately and used in conjunction with the model during inference, rather than being merged into the model's main vocabulary.

Possible Exceptions:

There might be cases where merging vocabularies is desirable, such as when:

The fine-tuning introduces critical new tokens that are essential for the model's new capabilities.
The project specifically aims to create a new base model with an expanded vocabulary.


If there is something missing/an issue/a better way to do something, I'm all ears or happy to figure it out.

@IIEleven11
Copy link
Author

Hmm, that section on "Maintaining Compatibility" I am not sure if I can totally agree. The entire point of training the new tokenizer is to add vocabulary to the base model making the fine tuned model more capable of the sounds within our dataset.

During inference if the model has a different vocab.json file it means its tokenizing differently than what it was trained on right? Or it lacks some of the tokenizing rules that the model had during training. It would in some case not know how to pronounce certain words or tokens.

The size of the model and therefore the potential impact on performance I can agree with to some degree but is it not just a side effect of having a larger vocabulary? I am not sure anything can be done about that on this level of development/use.

I am going to ask a friend of mine thats a little more informed on this specific thing and see what he says.

@erew123
Copy link
Owner

erew123 commented Jul 22, 2024

Sure! Sounds good :)

Just FYI, the last 6 weeks I've had to travel on/off for a family situation and I will be travelling again soon. This can limit my ability to review code/test/change code and obviously respond in any meaningful way. So if/as/when you respond, if you dont hear back from me for a bit, its because Im travelling, but will respond when I can.

@IIEleven11
Copy link
Author

Ok so who I talked to is the author of the tortoise voice cloning repo. Xttsv2 is essentially a child of the tortoise model. This makes a lot of the code interchangeable or the same.

He had this to say: "The error you're getting is size related and is most likely from the model. Are you using the latest XTTS2 model, it should have a larger text embedding table than 3491.

It's saying that your new vocab size is too large for the text embeddings, so your getting a shape mismatch.

For tortoise, my observation is you can train on a tokenizer with less tokens than the specified size of the weight, but not more and more is what seems to be happening here.

Oh, also, using the bpe tokenizer addition for training only doesn't make sense, so if this is the implementation, it would be incorrect
You still need some type of lookup table for those new tokens your added to be encoded and decoded"

This is essentially what I was attempting to get at. So what I think is happening is during the merging of the vocab.json's either nothing is happening or its for some reason still using the original vocab.
He also shared this which is the script to resize the base model embedding layer to fit the new vocab.

I can attempt to do this and send a PR. It is going to maybe be a fair amount of work to implement though. I was reading the commits and saw you had someone else helping with this portion. I am curious of their opinion too, if we could maybe ping him if possible?

Its the script to resize the base tortoise model https://github.com/JarodMica/ai-voice-cloning/blob/token_expansion/src/expand_tortoise.py

@erew123
Copy link
Owner

erew123 commented Jul 24, 2024

Hi @IIEleven11 Im currently traveling (as mentioned earlier), so short replies from me atm. I have contacted @bmalaski on the original PR and asked if they get chance to look over this incident and throw in any comments.

Thanks

@IIEleven11
Copy link
Author

No problem. Sounds good. I've been testing out a few things i'll let you know how it goes. Travel safe

@bmalaski
Copy link

Yea when I made that originally, I was using knowledge from text tokenizers, which is what I am familiar with. Digging into the code, its wrong. I have a change we can test to see if that works better, where it creates a BPE tokenizer with the new words and appends to the original. This can be loaded later without issues, but I dont have time to test to see how it changes training and generation atm, as I have also been traveling.

@IIEleven11
Copy link
Author

Yea when I made that originally, I was using knowledge from text tokenizers, which is what I am familiar with. Digging into the code, its wrong. I have a change we can test to see if that works better, where it creates a BPE tokenizer with the new words and appends to the original. This can be loaded later without issues, but I dont have time to test to see how it changes training and generation atm, as I have also been traveling.

Ok, i have some updates too. You're right the original tokenization process you're using creates a new vocab that does not follow the structure of the base model. This results in a smaller vocab and the model ends up speaking gibberish.

Also, I successfully wrote the script that expands the base model embedding layer according to the newly trained/custom vocab.json.

So that new script you wrote for the vocab.json combined with mine to expand the original model it should be all we need to straighten this out. The only question being how @erew123 wants to integrate it exactly. I can send a PR that just pushes a new script "expand_xtts.py" to the... /alltalk_tts/system/ft_tokenizer folder. I will go over any of the specifics of the script in the PR.

@erew123
Copy link
Owner

erew123 commented Jul 31, 2024

@bmalaski @IIEleven11 Thanks to both of you on this. @bmalaski Happy to look at that code, but cant see a recent update on your Github.

@IIEleven11 Happy to go over a PR and test it. Im guessing this is something that needs to be run at the end of training OR when the model is compacted and moved to its final folder?

@IIEleven11
Copy link
Author

@bmalaski @IIEleven11 Thanks to both of you on this. @bmalaski Happy to look at that code, but cant see a recent update on your Github.

@IIEleven11 Happy to go over a PR and test it. Im guessing this is something that needs to be run at the end of training OR when the model is compacted and moved to its final folder?

Ah yeah sorry I ran into some issues during testing. I think I got them though. I also ended up writing both scripts to merge the vocab and also expand the model. I am going to go through and make verbose comments for you, so you will find more details within them. Give me about an hour then you'lll see the PR.

This is all prior to fine tuning. We are making a new tokenizer/vocab during the whisper transcription phase. The new vocabulary needs to be merged with the base models vocabulary. This of course makes it bigger so we need to expand the embeddings layer of the base model so its capable of fine tuning with all of that extra vocab.
So its: merge the vocab.jsons then expand base xttsv2 model.pth then we can begin the fine tuning process.

@IIEleven11
Copy link
Author

IIEleven11 commented Aug 1, 2024

Ok, I just trained a model and we have a small issue. My merge script has some flawed logic. Its actually a problem I took lightly because it seemed easy enough at first. But if anyones interested and wants to attempt to solve. Here is what needs to happen.

We have our base XTTSv2 model. Im going to reference the 2.0.3 version here only. This model comes with a vocab.json.
Within that json are many keys and nested keys. Line 209 is the vocab key and it has nested keys until line 6890. Which is
"[hi]" : 6680.
(size of this vocab.json is considered 6680)
This marks the end of the base model vocab. Below this is the "merges" these go on to the end of the json.

The entirety of that base model vocab.json needs to stay intact. This is the most important part. When we make a new bpe_tokenizer.json it doesn't follow this exact mapping of vocab and merges. But it will have some of the same keys and values.

So what we want to do is:

  • Copy the entirety of the base models vocab.json.
  • compare against the new bpe_tokenizer.json
  • If a key in the new bpe_tokenizer.json is found that matches a key in the base model vocab.json then we ignore it.
  • If there is a key in the new bpe_tokenizer.json that is not in the base model vocab.json then it needs to be stripped of its value
  • then append the key to the base models vocab.json with a new value exceeding 6680.

So if the key "dog" : 734 was found in our new bpe_tokenizer.json and not in the base models then we would put it after "[hi]" : 6680.
So in this specific instance it would look like this:

...
    "समा": 6678,
    "कारी": 6679,
    "[hi]": 6680,
    "dog": 6681
  },
  "merges": [
     "t h",
     "i n",
     ...

Next is the merges.
ignore any multiples and append the rest to the bottom.

I'll be working on this too, just thought I'd share it with an update. I've attached the original 2.0.3 vocab.json for convenience
vocab.json

And we can test the outcome by comparing output of base model vocab.json and our newly merged vocab.json

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="/path/to/vocab.json")

# Tokenize
sample_sentence = "This is a test sentence."
tokens = tokenizer.tokenize(sample_sentence)
print(tokens)

@IIEleven11
Copy link
Author

Nvm I did it.

@erew123
Copy link
Owner

erew123 commented Aug 1, 2024

@IIEleven11 I will take a look/test the PR as soon I have my home PC in front of me :)

@bmalaski
Copy link

bmalaski commented Aug 2, 2024

for what its worth, here is what I was testing:


if create_bpe_tokenizer:
        print(f"Training BPE Tokenizer")
        vocab_file = VoiceBpeTokenizer(str(this_dir / base_path / "xttsv2_2.0.3" / "vocab.json"))

        for i in range(len(whisper_words)):
            whisper_words[i] = vocab_file.preprocess_text(whisper_words[i], target_language)

        trainer = BpeTrainer(special_tokens=['[STOP]', '[UNK]', '[SPACE]'], vocab_size=vocab_file.char_limits[target_language])
        tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
        tokenizer.pre_tokenizer = Whitespace()
        tokenizer.train_from_iterator(whisper_words, trainer, length=len(whisper_words))
        tokenizer.save(path=str(out_path / "trained-vocab.json"), pretty=True)

        vocab_file = json.load(open(os.path.join(str(this_dir / base_path / "xttsv2_2.0.3" / "vocab.json")), 'r', encoding='utf-8'))
        trained_file = json.load(open(os.path.join(str(out_path / "trained-vocab.json")), 'r', encoding='utf-8'))

        vocab_array = vocab_file.get('model', {}).get('vocab', {})
        vocab_array_trained = trained_file.get('model', {}).get('vocab', {})

        combined_vocab_array = {}
        for key, value in vocab_array.items():
            combined_vocab_array[key] = value

        next_index = int(len(vocab_array))
        for key, value in vocab_array_trained.items():
            if key not in combined_vocab_array:
                combined_vocab_array[key] = next_index
                next_index += 1

        vocab_file['model']['vocab'] = combined_vocab_array

        with open(str(out_path / "bpe_tokenizer-vocab.json"), 'w', encoding='utf-8') as file:
            json.dump(vocab_file, file, indent=4, ensure_ascii=False)

I found that merging in the new merges led to speech issues, with slurred words.

@IIEleven11
Copy link
Author

for what its worth, here is what I was testing:


if create_bpe_tokenizer:
        print(f"Training BPE Tokenizer")
        vocab_file = VoiceBpeTokenizer(str(this_dir / base_path / "xttsv2_2.0.3" / "vocab.json"))

        for i in range(len(whisper_words)):
            whisper_words[i] = vocab_file.preprocess_text(whisper_words[i], target_language)

        trainer = BpeTrainer(special_tokens=['[STOP]', '[UNK]', '[SPACE]'], vocab_size=vocab_file.char_limits[target_language])
        tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
        tokenizer.pre_tokenizer = Whitespace()
        tokenizer.train_from_iterator(whisper_words, trainer, length=len(whisper_words))
        tokenizer.save(path=str(out_path / "trained-vocab.json"), pretty=True)

        vocab_file = json.load(open(os.path.join(str(this_dir / base_path / "xttsv2_2.0.3" / "vocab.json")), 'r', encoding='utf-8'))
        trained_file = json.load(open(os.path.join(str(out_path / "trained-vocab.json")), 'r', encoding='utf-8'))

        vocab_array = vocab_file.get('model', {}).get('vocab', {})
        vocab_array_trained = trained_file.get('model', {}).get('vocab', {})

        combined_vocab_array = {}
        for key, value in vocab_array.items():
            combined_vocab_array[key] = value

        next_index = int(len(vocab_array))
        for key, value in vocab_array_trained.items():
            if key not in combined_vocab_array:
                combined_vocab_array[key] = next_index
                next_index += 1

        vocab_file['model']['vocab'] = combined_vocab_array

        with open(str(out_path / "bpe_tokenizer-vocab.json"), 'w', encoding='utf-8') as file:
            json.dump(vocab_file, file, indent=4, ensure_ascii=False)

I found that merging in the new merges led to speech issues, with slurred words.

Yes so my new merge script while having no apparent errors also led to slurred speech and gibberish. So... There is most certainly some nuance were missing.

@IIEleven11
Copy link
Author

It's the tokenizer.py script I think. We need to make it more aggressive with what it decides to tokenize.

@IIEleven11
Copy link
Author

Ok so to update, merge and expand scripts both work as expected but the creation of the new vocab.json needs to be done the same way that Coqui did it. It would help if someone could locate the script coqui used to create the vocab.json (I believe it was originally named tokenizer.json).

I thought coqui's tokenizer.py would create the vocab.json but this isn't the case. I could also be over looking something. If anyone has any insight please let me know.

I trained a new model and the slurred speech is gone but it has an English accent for no apparent reason. Something to note is this issue was brought up with the 2.0.3 model by a couple of people. I have trained on this dataset before though and this is the first I am seeing of it. Which I would think points to a potential tokenizer problem.

@erew123
Copy link
Owner

erew123 commented Aug 3, 2024

@IIEleven11 There is a tokenizer.json in the Tortoise scripts https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/utils/assets/tortoise/tokenizer.json

not sure if thats the one....

Or could it be this file https://github.com/coqui-ai/TTS/blob/dev/tests/inputs/xtts_vocab.json

@IIEleven11
Copy link
Author

@IIEleven11 There is a tokenizer.json in the Tortoise scripts https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/utils/assets/tortoise/tokenizer.json

not sure if thats the one....

Or could it be this file https://github.com/coqui-ai/TTS/blob/dev/tests/inputs/xtts_vocab.json

Ahh good find! I have been looking for this forever. I looked over it because I figured they would at least put it in a different location. Thats what I get for assuming, thank you though. With this I can get some answers hopefully

@erew123
Copy link
Owner

erew123 commented Aug 4, 2024

@IIEleven11 Ill hang back on merging anything from the PR, in case you find anything. I have just merged 1x small update from @bmalaski whom found a much faster whisper model, which after testing, really speeds up the initial step one for dataset generation, so just so you are aware that got merged in.

@IIEleven11
Copy link
Author

@IIEleven11 Ill hang back on merging anything from the PR, in case you find anything. I have just merged 1x small update from @bmalaski whom found a much faster whisper model, which after testing, really speeds up the initial step one for dataset generation, so just so you are aware that got merged in.

I successfully removed all non English characters from the 2.0.3 tokenizer and trained a model that resulted in a very clean result without an accent. This process was a bit more complex and nuanced than I thought it would be though. Because of this I think the best option is to make the script with a variable that uses the 2.0.2 or 2.0.3 vocab and base model for merge and expansion depending on the choice of the end user while making sure to mention the potential the 2.0.3 model has for artifacting/accent of some sort.

As for the change in whisper model. The quality of this whole process really relies upon the quality of that transcription. If whisper fails to accurately transcribe the dataset then when we create a vocab from that dataset it will have a very damaging effect on our fine tuned model. So while there are most definitely faster options than whisper large v3 and I can appreciate a balance of speed and accuracy. I don't feel we have room to budge, a poor tokenizer would negate the entire fine tuning process entirely. Unless of course it is faster and more accurate.

@Mixomo
Copy link

Mixomo commented Aug 30, 2024

Hey, In this discussion thread, I have talked about premade datasets and bpe tokenizer:

#245 (comment)

Is it possible to train the bpe tokenizer from pre-made transcripts instead of using whisper?
Thank you.

@IIEleven11
Copy link
Author

Hey, In this discussion thread, I have talked about premade datasets and bpe tokenizer:

#245 (comment)

Is it possible to train the bpe tokenizer from pre-made transcripts instead of using whisper?
Thank you.

Yes, it doesn't matter what you use to transcribe the audio. The script I wrote will just strip it of all formatting and look at the words before making the vocab.

@Mixomo
Copy link

Mixomo commented Sep 2, 2024

Hey, In this discussion thread, I have talked about premade datasets and bpe tokenizer:

#245 (comment)

Is it possible to train the bpe tokenizer from pre-made transcripts instead of using whisper?
Thank you.

Yes, it doesn't matter what you use to transcribe the audio. The script I wrote will just strip it of all formatting and look at the words before making the vocab.

Hey, What I'm saying is that I already have the transcripts prepared in advance, so I don't need to use Whisper and can go directly to the training. As I understand it, your script performs the transcription in step 1 using Whisper. By skipping that step, I wouldn't be training the BPE tokenizer either.

What I wanted to know or ask for is a way to train the tokenizer in step 2, before the training, instead of in step 1 along with Whisper.

As I explained in that thread, I already have a script that formats the transcriptions to the format accepted by all talk, so the only thing left would be to use those CSV transcriptions to train the tokenizer.

I hope I've explained myself clearly 😅

@IIEleven11
Copy link
Author

Hey, In this discussion thread, I have talked about premade datasets and bpe tokenizer:
#245 (comment)
Is it possible to train the bpe tokenizer from pre-made transcripts instead of using whisper?
Thank you.

Yes, it doesn't matter what you use to transcribe the audio. The script I wrote will just strip it of all formatting and look at the words before making the vocab.

Hey, What I'm saying is that I already have the transcripts prepared in advance, so I don't need to use Whisper and can go directly to the training. As I understand it, your script performs the transcription in step 1 using Whisper. By skipping that step, I wouldn't be training the BPE tokenizer either.

What I wanted to know or ask for is a way to train the tokenizer in step 2, before the training, instead of in step 1 along with Whisper.

As I explained in that thread, I already have a script that formats the transcriptions to the format accepted by all talk, so the only thing left would be to use those CSV transcriptions to train the tokenizer.

I hope I've explained myself clearly 😅

None of my scripts have to do with transcription. This whole process would still need to be added into the webui appropriately. As of right now you would have to run each script seperately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants