coqui_stt_training.train never finishes #2195

FrontierDK · 2022-04-19T09:54:27Z

Hi all.

I am having an issue where coqui_stt_training.train never finishes. Even if I try using --epochs 3, it just waits forever for "something".

This is where it is just waiting. CPU usage has dropped to 0% - and stays there.

Is there any way of seeing what it is waiting for?

HarikalarKutusu · 2022-04-20T20:58:32Z

We replicated the issue. I confirm the issue is existing on Colab, and so does @wasertech (ran on local machine).

wasertech · 2022-04-20T20:58:55Z

Yep check out the full logs here: https://gist.github.com/wasertech/1cbde2c1399148cf9954fc42d272265b.
The process hangs out at the end without returning to the shell. Makeing automation really difficult.

EDIT: I've also tried without the testing interface and this time it gets stuck after closing the session: https://gist.github.com/wasertech/4684fabe991718190a5c3245f0c0c187

HarikalarKutusu · 2022-04-20T22:20:19Z

I checked with pip (v1.3.0), git clone v1.3.0 and git clone main all failed.
Could not pass another bug with v1.2.0 ( #2110 )
But it worked OK with git clone v1.1.0

reuben · 2022-04-21T12:58:00Z

Can you provide a script to reproduce locally? Are you using GPUs/CUDA? Does this only happen on Collab?

wasertech · 2022-04-21T13:04:54Z

I can confirm it hangs on GPU with Docker and inside Colab too. I think @FrontierDK was using CPU but I'm not sure.
Run 3 epochs on any dataset with any batch size using python -m coqui_stt_training.train and weather you are using only the training interface or in conjonction with the testing one, the process never retunrns any exit code after done.
This behavior has only been noticed on 1.3.0 and above so far.

FrontierDK · 2022-04-21T13:07:51Z

I think @FrontierDK was using CPU but I'm not sure.

I was/am using CPU - true :)

wasertech · 2022-04-21T13:27:46Z

So I tried to use the old train interface to make the same test and it still doesn't return any exit code. https://gist.github.com/wasertech/a7bd3ae2606e143bf70a540972c3314b

wasertech · 2022-04-21T13:42:43Z

it also hangs after doing only one epoch: https://gist.github.com/wasertech/e6a460532c0c8ee9f9ea4ed06073194f

wasertech · 2022-04-22T14:27:38Z

@reuben #2198 is a script to reproduce locally. Just run the script with some arguments like:

path_to_mailabs="/mnt/extracted/data/M-AILABS/"
mailabs_lang="fr_FR"
path_to_alphabet="/mnt/models/alphabet.txt"
path_to_scorer="/mnt/lm/kenlm.scorer"
./bin/run-ci-mailabs_time.sh ${path_to_mailabs} ${mailabs_lang} ${path_to_alphabet} ${path_to_scorer}

As you can see in my logs, it never reaches the lines bellow 39 to print the exec time.

wasertech · 2022-04-22T20:37:44Z

I managed to fix it in this branch by commenting out the memory test.

wasertech · 2022-04-26T17:15:29Z

So with #2205 you'll be able to skip the memory test which is causing this issue by using coqui_stt_training.train with the flag --skip_batch_test set to true.

FrontierDK · 2022-04-26T20:35:45Z

When I try using --skip_batch_test=true I get train.py: error: unrecognized arguments: --skip_batch_test=true

wasertech · 2022-04-26T20:45:28Z

@FrontierDK Again, #2205 is not merged into main. As long as #2205 is open, it means the code has not merged with coqui yet. Wait for @reuben to accept my proposition or checkout my branch directly: https://github.com/wasertech/STT/tree/workaround_mem_test

FrontierDK · 2022-04-26T21:16:55Z

My line
python3 -m coqui_stt_training.train --checkpoint_dir ~/coqui-stt-1.3.0-checkpoint --train_files ~/speech/sample.csv --dev_files ~/speech/dev.csv --test_files ~/speech/test.csv --n_hidden 2048 --load_cudnn false --epochs 3 --skip_batch_size true

Then I get train.py: error: unrecognized arguments: --skip_batch_size true

wasertech, I tried your fix yesterday and it worked :)

wasertech · 2022-04-27T15:07:06Z

So with #2205 merged, you can now use the —skip_batch_size flag when running coqui_stt_training.train module.

FrontierDK · 2022-04-27T20:29:18Z

I don't think it has merged yet...I just tried installing a new VM and I still get this error:
train.py: error: unrecognized arguments: --skip_batch_size

This should have gotten the latest changes into the VM:
git clone --depth 1 --recurse-submodules --shallow-submodules https://github.com/coqui-ai/STT

HarikalarKutusu · 2022-04-27T23:48:16Z

Hmmm. Common Voice v9.0 is out and I want to start training some languages. It seems I will be working from a patched fork :(

wasertech · 2022-04-28T00:51:14Z

@FrontierDK It is merged. I've just rebuilt a docker container from main and tested the following:

❯ docker build --rm -f Dockerfile.train -t stt-train .
...
Successfully tagged stt-train:latest

❯ docker run -it --entrypoint 'bash' stt-train
root@3412bf80716e:/code# cat VERSION
1.4.0-alpha.1
root@3412bf80716e:/code# python -m coqui_stt_training.train --help
...
--skip_batch_test true/false
                        Coqpit Field: skip batch size memory test before
                        training
...

As expected. Full logs

@HarikalarKutusu Why? It doesn't work for you too?

HarikalarKutusu · 2022-04-28T00:54:20Z

@wasertech 👍 Thanks for the heads up, I didn't try yet, just assumed that it is not merged.

FrontierDK · 2022-04-28T07:33:36Z

I just installed a new/fresh VM, still seeing the error here.

As yesterday, Coqui was installed with git clone --depth 1 --recurse-submodules --shallow-submodules https://github.com/coqui-ai/STT

Update: I need to copy/paste from the right places, just did skip_batch_test and it's not throwing an error now =)

hammondm · 2022-04-30T18:02:57Z

Has this change been made with the docker image yet?

this: ghcr.io/coqui-ai/stt-train:latest

Update: the pre-built image is not yet updated but, like the person below, I was able to rebuild it from Dockerfile.train in the latest repo.

HarikalarKutusu · 2022-04-30T20:45:30Z

I could get it with git clone main repo...

wasertech · 2022-05-14T16:20:08Z

Has this change been made with the docker image yet?

this: ghcr.io/coqui-ai/stt-train:latest

Update: the pre-built image is not yet updated but, like the person below, I was able to rebuild it from Dockerfile.train in the latest repo.

Well I found this image which should contain the latest fixes but it doesn’t bare the ‘latest’ tag.

reuben · 2022-05-15T19:48:31Z

latest means latest stable release. We tag builds from main with the main tag as well: https://github.com/coqui-ai/STT/pkgs/container/stt-train/20747534?tag=main

We don't want to test batch size here. It causes the process to never die (see coqui-ai#2195) and is not the intent of the test anyway.

bernardohenz · 2023-04-18T21:16:26Z

I'm facing the same issue: training just hangs out at the end of training.

I've tested with the following docker images:

ghcr.io/coqui-ai/stt-train:latest
ghcr.io/coqui-ai/stt-train:597e6ebb5c1a398c10b3a84810f93c1ce96d7926
ghcr.io/coqui-ai/stt-train:e7d2af96bd16556c7ef828a660acc881dd51c4b
docker building from master

All of them finished training but keep hanging there :(

wasertech · 2023-04-18T21:22:08Z

Yes, @bernardohenz. But I as I've said in my comments above:
Since STT 1.4.0, ...

..., you can now use the —skip_batch_size flag when running coqui_stt_training.train module.

This will skip the memory test that create this issue.

bernardohenz · 2023-04-18T22:46:51Z

Thanks @wasertech , I didn't get at the first, sorry 👍

FrontierDK added the bug Something isn't working label Apr 19, 2022

wasertech mentioned this issue Apr 22, 2022

Test time to return exit code after training #2198

Closed

wasertech mentioned this issue Apr 22, 2022

Fix exit training session #2199

Closed

This was referenced Apr 23, 2022

Fix exit train w/ memory test but without setting any limit #2200

Closed

Workaround batch size memory test #2205

Merged

wasertech mentioned this issue May 21, 2022

Switch to Coqui STT 1.4.0 common-voice/commonvoice-fr#163

Closed

wasertech added a commit to wasertech/STT that referenced this issue Jun 4, 2022

Update run-ci-ldc93s1_new.sh

f060ed8

We don't want to test batch size here. It causes the process to never die (see coqui-ai#2195) and is not the intent of the test anyway.

wasertech mentioned this issue Jun 4, 2022

Update run-ci-ldc93s1_new.sh #2234

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coqui_stt_training.train never finishes #2195

coqui_stt_training.train never finishes #2195

FrontierDK commented Apr 19, 2022

HarikalarKutusu commented Apr 20, 2022

wasertech commented Apr 20, 2022 •

edited

Loading

HarikalarKutusu commented Apr 20, 2022 •

edited

Loading

reuben commented Apr 21, 2022

wasertech commented Apr 21, 2022 •

edited

Loading

FrontierDK commented Apr 21, 2022

wasertech commented Apr 21, 2022 •

edited

Loading

wasertech commented Apr 21, 2022

wasertech commented Apr 22, 2022

wasertech commented Apr 22, 2022

wasertech commented Apr 26, 2022

FrontierDK commented Apr 26, 2022

wasertech commented Apr 26, 2022 •

edited

Loading

FrontierDK commented Apr 26, 2022 •

edited

Loading

wasertech commented Apr 27, 2022

FrontierDK commented Apr 27, 2022

HarikalarKutusu commented Apr 27, 2022

wasertech commented Apr 28, 2022 •

edited

Loading

HarikalarKutusu commented Apr 28, 2022

FrontierDK commented Apr 28, 2022 •

edited

Loading

hammondm commented Apr 30, 2022 •

edited

Loading

HarikalarKutusu commented Apr 30, 2022

wasertech commented May 14, 2022

reuben commented May 15, 2022

bernardohenz commented Apr 18, 2023

wasertech commented Apr 18, 2023

bernardohenz commented Apr 18, 2023

coqui_stt_training.train never finishes #2195

coqui_stt_training.train never finishes #2195

Comments

FrontierDK commented Apr 19, 2022

HarikalarKutusu commented Apr 20, 2022

wasertech commented Apr 20, 2022 • edited Loading

HarikalarKutusu commented Apr 20, 2022 • edited Loading

reuben commented Apr 21, 2022

wasertech commented Apr 21, 2022 • edited Loading

FrontierDK commented Apr 21, 2022

wasertech commented Apr 21, 2022 • edited Loading

wasertech commented Apr 21, 2022

wasertech commented Apr 22, 2022

wasertech commented Apr 22, 2022

wasertech commented Apr 26, 2022

FrontierDK commented Apr 26, 2022

wasertech commented Apr 26, 2022 • edited Loading

FrontierDK commented Apr 26, 2022 • edited Loading

wasertech commented Apr 27, 2022

FrontierDK commented Apr 27, 2022

HarikalarKutusu commented Apr 27, 2022

wasertech commented Apr 28, 2022 • edited Loading

HarikalarKutusu commented Apr 28, 2022

FrontierDK commented Apr 28, 2022 • edited Loading

hammondm commented Apr 30, 2022 • edited Loading

HarikalarKutusu commented Apr 30, 2022

wasertech commented May 14, 2022

reuben commented May 15, 2022

bernardohenz commented Apr 18, 2023

wasertech commented Apr 18, 2023

bernardohenz commented Apr 18, 2023

wasertech commented Apr 20, 2022 •

edited

Loading

HarikalarKutusu commented Apr 20, 2022 •

edited

Loading

wasertech commented Apr 21, 2022 •

edited

Loading

wasertech commented Apr 21, 2022 •

edited

Loading

wasertech commented Apr 26, 2022 •

edited

Loading

FrontierDK commented Apr 26, 2022 •

edited

Loading

wasertech commented Apr 28, 2022 •

edited

Loading

FrontierDK commented Apr 28, 2022 •

edited

Loading

hammondm commented Apr 30, 2022 •

edited

Loading