-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to start xtts v2 training process. #3303
Comments
I saw this issue yesterday. Which dataset format are you using? My issue was due to the fact it was expecting a pipe-delimited csv when the ljspeech format for the metadata.csv and I still had it as a comma delimtied |
Hey @Okohedeki, I'm using ljspeech format. I've formmated my dataset in ljspeech format. |
Are you sure that the csv is pipe delmited? Just because there are pipes in the csv doesn't make it a pipe-delimted dataset. For example when I was saving the csv I had this line here:
The error is differently because the dataset is not correct |
Yes I can confirm that this is not the case. It is pipe limited ("|"). The same dataset is working on different approaches such as vits and yourtts! |
Only other thing is if you go to this file here: TTS\TTS\tts\datasets\formatters.py for the ljspeech function can you print out the actual path of the file? It should be the txt_file. I had to change the line to:
to stop it from appending .wav to my file that was already saved as .wav |
yes I have the exact same format as ljspeech. |
Hi @arbianqx, This message "> Total eval samples after filtering: 0" indicates that you don't have any eval samples. It can be caused by three reasons:
In all these scenarios, you need to change (or create) your eval CSV to meet the requirements for training. Alternatively, the PR #3296 implements a gradio demo for data processing plus training and inference for XTTS model. On the PR, have also have a Google Colab and soon we will do a video showing how to use the demo. |
Reopen if the comment above doesnt help. |
Hey @arbianqx! I would like to train XTTSv2 on my own dataset, but I've no clue on how to start. Could you provide me some resources/notebooks that will help me get started? Thanks! |
I use the formatter method to process my audio files(Chinese language), but I got the csv files with no data. Because it has never met the condition of I am sure that the whisper model outputs are fine: (Pdb) words_list[0]
Word(start=0.0, end=0.42, word='但', probability=0.82470703125) def format_audio_list(audio_files, target_language="en", out_path=None, buffer=0.2, eval_percentage=0.15, speaker_name="coqui", gradio_progress=None):
audio_total_size = 0
# make sure that ooutput file exists
os.makedirs(out_path, exist_ok=True)
# Loading Whisper
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Loading Whisper Model!")
asr_model = WhisperModel("large-v2", device=device, compute_type="float16")
metadata = {"audio_file": [], "text": [], "speaker_name": []}
if gradio_progress is not None:
tqdm_object = gradio_progress.tqdm(audio_files, desc="Formatting...")
else:
tqdm_object = tqdm(audio_files)
for audio_path in tqdm_object:
wav, sr = torchaudio.load(audio_path)
# stereo to mono if needed
if wav.size(0) != 1:
wav = torch.mean(wav, dim=0, keepdim=True)
wav = wav.squeeze()
audio_total_size += (wav.size(-1) / sr)
segments, _ = asr_model.transcribe(audio_path, word_timestamps=True, language=target_language)
segments = list(segments)
i = 0
sentence = ""
sentence_start = None
first_word = True
# added all segments words in a unique list
words_list = []
for _, segment in enumerate(segments):
words = list(segment.words)
words_list.extend(words)
# process each word
for word_idx, word in enumerate(words_list):
if first_word:
sentence_start = word.start
# If it is the first sentence, add buffer or get the begining of the file
if word_idx == 0:
sentence_start = max(sentence_start - buffer, 0) # Add buffer to the sentence start
else:
# get previous sentence end
previous_word_end = words_list[word_idx - 1].end
# add buffer or get the silence midle between the previous sentence and the current one
sentence_start = max(sentence_start - buffer, (previous_word_end + sentence_start)/2)
sentence = word.word
first_word = False
else:
sentence += word.word
if word.word[-1] in ["!", ".", "?"]:
sentence = sentence[1:]
# Expand number and abbreviations plus normalization
sentence = multilingual_cleaners(sentence, target_language)
audio_file_name, _ = os.path.splitext(os.path.basename(audio_path))
audio_file = f"wavs/{audio_file_name}_{str(i).zfill(8)}.wav"
# Check for the next word's existence
if word_idx + 1 < len(words_list):
next_word_start = words_list[word_idx + 1].start
else:
# If don't have more words it means that it is the last sentence then use the audio len as next word start
next_word_start = (wav.shape[0] - 1) / sr
# Average the current word end and next word start
word_end = min((word.end + next_word_start) / 2, word.end + buffer)
absoulte_path = os.path.join(out_path, audio_file)
os.makedirs(os.path.dirname(absoulte_path), exist_ok=True)
i += 1
first_word = True
audio = wav[int(sr*sentence_start):int(sr*word_end)].unsqueeze(0)
# if the audio is too short ignore it (i.e < 0.33 seconds)
if audio.size(-1) >= sr/3:
torchaudio.save(absoulte_path,
audio,
sr
)
else:
continue
metadata["audio_file"].append(audio_file)
metadata["text"].append(sentence)
metadata["speaker_name"].append(speaker_name)
df = pandas.DataFrame(metadata)
df = df.sample(frac=1)
num_val_samples = int(len(df)*eval_percentage)
df_eval = df[:num_val_samples]
df_train = df[num_val_samples:]
df_train = df_train.sort_values('audio_file')
train_metadata_path = os.path.join(out_path, "metadata_train.csv")
df_train.to_csv(train_metadata_path, sep="|", index=False)
eval_metadata_path = os.path.join(out_path, "metadata_eval.csv")
df_eval = df_eval.sort_values('audio_file')
df_eval.to_csv(eval_metadata_path, sep="|", index=False)
# deallocate VRAM and RAM
del asr_model, df_train, df_eval, df, metadata
gc.collect()
return train_metadata_path, eval_metadata_path, audio_total_size |
So can we use a dataset which contains multiple speakers but with the same language to train xtts v2? |
Describe the bug
I have prepared my own dataset in LJSpeech format. Tried starting the training process based on the recipe, but was unable to do so. I think it's acting like this since the dataset is not, in supported list provided by xtts v2. I get the following error:
AssertionError: ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.
The same dataset, can be used in different training scripts/approaches, such as vits or yourtts.
To Reproduce
Run training script with another language dataset!
Expected behavior
Training should be started.
Logs
Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: