Unable to start xtts v2 training process. #3303

arbianqx · 2023-11-24T17:20:37Z

Describe the bug

I have prepared my own dataset in LJSpeech format. Tried starting the training process based on the recipe, but was unable to do so. I think it's acting like this since the dataset is not, in supported list provided by xtts v2. I get the following error:
AssertionError: ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.
The same dataset, can be used in different training scripts/approaches, such as vits or yourtts.

To Reproduce

Run training script with another language dataset!

Expected behavior

Training should be started.

Logs

> EPOCH: 0/1000
 --> /TTS/run/training/GPT_XTTS_v2.0_LJSpeech_FT-November-24-2023_05+18PM-990b209
 > Filtering invalid eval samples!!
[!] Warning: The text length exceeds the character limit of 250 for language 'sq', this might cause truncated audio.
[!] Warning: The text length exceeds the character limit of 250 for language 'sq', this might cause truncated audio.
 > Total eval samples after filtering: 0
 ! Run is removed from /TTS/run/training/GPT_XTTS_v2.0_LJSpeech_FT-November-24-2023_05+18PM-990b209
Traceback (most recent call last):
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 1826, in fit
    self._fit()
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 1780, in _fit
    self.eval_epoch()
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 1628, in eval_epoch
    self.get_eval_dataloader(
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 990, in get_eval_dataloader
    return self._get_loader(
  File "tts/lib/python3.10/site-packages/trainer/trainer.py", line 914, in _get_loader
    len(loader) > 0
AssertionError:  ❗ len(DataLoader) returns 0. Make sure your dataset is not empty or len(dataset) > 0.

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB",
            "NVIDIA A100-PCIE-40GB"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "TTS": "0.21.1",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.8",
        "version": "#202212290932~1674066459~20.04~3cd2bf3-Ubuntu SMP PREEMPT_DYNAMI"
    }
}

Additional context

No response

The text was updated successfully, but these errors were encountered:

Okohedeki · 2023-11-24T20:23:06Z

I saw this issue yesterday. Which dataset format are you using? My issue was due to the fact it was expecting a pipe-delimited csv when the ljspeech format for the metadata.csv and I still had it as a comma delimtied

arbianqx · 2023-11-24T20:25:06Z

I saw this issue yesterday. Which dataset format are you using? My issue was due to the fact it was expecting a pipe-delimited csv when the ljspeech format for the metadata.csv and I still had it as a comma delimtied

Hey @Okohedeki, I'm using ljspeech format. I've formmated my dataset in ljspeech format.

Okohedeki · 2023-11-24T20:27:39Z

Are you sure that the csv is pipe delmited? Just because there are pipes in the csv doesn't make it a pipe-delimted dataset. For example when I was saving the csv I had this line here:
csv_writer = csv.writer(csv_file)
and I had to switch it to

The error is differently because the dataset is not correct

arbianqx · 2023-11-24T20:32:12Z

Yes I can confirm that this is not the case. It is pipe limited ("|"). The same dataset is working on different approaches such as vits and yourtts!

Okohedeki · 2023-11-24T20:54:43Z

Only other thing is if you go to this file here:

TTS\TTS\tts\datasets\formatters.py

for the ljspeech function can you print out the actual path of the file? It should be the txt_file. I had to change the line to:

            # wav_file = os.path.join(root_path, "wavs", cols[0] + ".wav")
            wav_file = os.path.join(root_path, "wavs", cols[0])

to stop it from appending .wav to my file that was already saved as .wav

arbianqx · 2023-11-25T10:29:22Z

yes I have the exact same format as ljspeech.

Edresson · 2023-11-27T18:23:05Z

Hi @arbianqx,

This message "> Total eval samples after filtering: 0" indicates that you don't have any eval samples. It can be caused by three reasons:

The Eval CSV that you provided is empty;
The samples on the eval CSV that you provided are bigger than the max_wav_len and max_text_len defined on the recipe (https://github.com/coqui-ai/TTS/blob/dev/recipes/ljspeech/xtts_v2/train_gpt_xtts.py#L86C1-L87C29). Note that you do not recommend the changes of these values for fine-tuning;
You do not provide an Eval CSV and all the samples automatically selected are bigger than max_wav_length and max_text_length.

In all these scenarios, you need to change (or create) your eval CSV to meet the requirements for training.

Alternatively, the PR #3296 implements a gradio demo for data processing plus training and inference for XTTS model. On the PR, have also have a Google Colab and soon we will do a video showing how to use the demo.

erogol · 2023-11-28T10:36:19Z

Reopen if the comment above doesnt help.

rumbleFTW · 2023-11-28T22:07:25Z

Hey @arbianqx! I would like to train XTTSv2 on my own dataset, but I've no clue on how to start. Could you provide me some resources/notebooks that will help me get started? Thanks!

dorbodwolf · 2023-12-31T13:57:38Z

I use the formatter method to process my audio files(Chinese language), but I got the csv files with no data. Because it has never met the condition of if word.word[-1] in ["!", ".", "?"]:

I am sure that the whisper model outputs are fine:

(Pdb) words_list[0]
Word(start=0.0, end=0.42, word='但', probability=0.82470703125)

def format_audio_list(audio_files, target_language="en", out_path=None, buffer=0.2, eval_percentage=0.15, speaker_name="coqui", gradio_progress=None):
    audio_total_size = 0
    # make sure that ooutput file exists
    os.makedirs(out_path, exist_ok=True)

    # Loading Whisper
    device = "cuda" if torch.cuda.is_available() else "cpu" 

    print("Loading Whisper Model!")
    asr_model = WhisperModel("large-v2", device=device, compute_type="float16")

    metadata = {"audio_file": [], "text": [], "speaker_name": []}

    if gradio_progress is not None:
        tqdm_object = gradio_progress.tqdm(audio_files, desc="Formatting...")
    else:
        tqdm_object = tqdm(audio_files)

    for audio_path in tqdm_object:
        wav, sr = torchaudio.load(audio_path)
        # stereo to mono if needed
        if wav.size(0) != 1:
            wav = torch.mean(wav, dim=0, keepdim=True)

        wav = wav.squeeze()
        audio_total_size += (wav.size(-1) / sr)

        segments, _ = asr_model.transcribe(audio_path, word_timestamps=True, language=target_language)
        segments = list(segments)
        i = 0
        sentence = ""
        sentence_start = None
        first_word = True
        # added all segments words in a unique list
        words_list = []
        for _, segment in enumerate(segments):
            words = list(segment.words)
            words_list.extend(words)

        # process each word
        for word_idx, word in enumerate(words_list):
            if first_word:
                sentence_start = word.start
                # If it is the first sentence, add buffer or get the begining of the file
                if word_idx == 0:
                    sentence_start = max(sentence_start - buffer, 0)  # Add buffer to the sentence start
                else:
                    # get previous sentence end
                    previous_word_end = words_list[word_idx - 1].end
                    # add buffer or get the silence midle between the previous sentence and the current one
                    sentence_start = max(sentence_start - buffer, (previous_word_end + sentence_start)/2)

                sentence = word.word
                first_word = False
            else:
                sentence += word.word

            if word.word[-1] in ["!", ".", "?"]:
                sentence = sentence[1:]
                # Expand number and abbreviations plus normalization
                sentence = multilingual_cleaners(sentence, target_language)
                audio_file_name, _ = os.path.splitext(os.path.basename(audio_path))

                audio_file = f"wavs/{audio_file_name}_{str(i).zfill(8)}.wav"

                # Check for the next word's existence
                if word_idx + 1 < len(words_list):
                    next_word_start = words_list[word_idx + 1].start
                else:
                    # If don't have more words it means that it is the last sentence then use the audio len as next word start
                    next_word_start = (wav.shape[0] - 1) / sr

                # Average the current word end and next word start
                word_end = min((word.end + next_word_start) / 2, word.end + buffer)
                
                absoulte_path = os.path.join(out_path, audio_file)
                os.makedirs(os.path.dirname(absoulte_path), exist_ok=True)
                i += 1
                first_word = True

                audio = wav[int(sr*sentence_start):int(sr*word_end)].unsqueeze(0)
                # if the audio is too short ignore it (i.e < 0.33 seconds)
                if audio.size(-1) >= sr/3:
                    torchaudio.save(absoulte_path,
                        audio,
                        sr
                    )
                else:
                    continue

                metadata["audio_file"].append(audio_file)
                metadata["text"].append(sentence)
                metadata["speaker_name"].append(speaker_name)

    df = pandas.DataFrame(metadata)
    df = df.sample(frac=1)
    num_val_samples = int(len(df)*eval_percentage)

    df_eval = df[:num_val_samples]
    df_train = df[num_val_samples:]

    df_train = df_train.sort_values('audio_file')
    train_metadata_path = os.path.join(out_path, "metadata_train.csv")
    df_train.to_csv(train_metadata_path, sep="|", index=False)

    eval_metadata_path = os.path.join(out_path, "metadata_eval.csv")
    df_eval = df_eval.sort_values('audio_file')
    df_eval.to_csv(eval_metadata_path, sep="|", index=False)

    # deallocate VRAM and RAM
    del asr_model, df_train, df_eval, df, metadata
    gc.collect()

    return train_metadata_path, eval_metadata_path, audio_total_size

OswaldoBornemann · 2024-04-18T08:30:09Z

So can we use a dataset which contains multiple speakers but with the same language to train xtts v2?

arbianqx added the bug Something isn't working label Nov 24, 2023

Edresson self-assigned this Nov 27, 2023

erogol closed this as completed Nov 28, 2023

dorbodwolf mentioned this issue Dec 31, 2023

xtts ft demo: empty csv files with the format_audio_list #3480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to start xtts v2 training process. #3303

Unable to start xtts v2 training process. #3303

arbianqx commented Nov 24, 2023

Okohedeki commented Nov 24, 2023 •

edited

Loading

arbianqx commented Nov 24, 2023 •

edited

Loading

Okohedeki commented Nov 24, 2023 •

edited

Loading

arbianqx commented Nov 24, 2023

Okohedeki commented Nov 24, 2023 •

edited

Loading

arbianqx commented Nov 25, 2023

Edresson commented Nov 27, 2023 •

edited

Loading

erogol commented Nov 28, 2023

rumbleFTW commented Nov 28, 2023

dorbodwolf commented Dec 31, 2023 •

edited

Loading

OswaldoBornemann commented Apr 18, 2024

Unable to start xtts v2 training process. #3303

Unable to start xtts v2 training process. #3303

Comments

arbianqx commented Nov 24, 2023

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

Okohedeki commented Nov 24, 2023 • edited Loading

arbianqx commented Nov 24, 2023 • edited Loading

Okohedeki commented Nov 24, 2023 • edited Loading

arbianqx commented Nov 24, 2023

Okohedeki commented Nov 24, 2023 • edited Loading

arbianqx commented Nov 25, 2023

Edresson commented Nov 27, 2023 • edited Loading

erogol commented Nov 28, 2023

rumbleFTW commented Nov 28, 2023

dorbodwolf commented Dec 31, 2023 • edited Loading

OswaldoBornemann commented Apr 18, 2024

Okohedeki commented Nov 24, 2023 •

edited

Loading

arbianqx commented Nov 24, 2023 •

edited

Loading

Okohedeki commented Nov 24, 2023 •

edited

Loading

Okohedeki commented Nov 24, 2023 •

edited

Loading

Edresson commented Nov 27, 2023 •

edited

Loading

dorbodwolf commented Dec 31, 2023 •

edited

Loading