Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coqui_stt_training.train never finishes #2195

Open
FrontierDK opened this issue Apr 19, 2022 · 27 comments
Open

coqui_stt_training.train never finishes #2195

FrontierDK opened this issue Apr 19, 2022 · 27 comments
Labels
bug Something isn't working

Comments

@FrontierDK
Copy link

Hi all.

I am having an issue where coqui_stt_training.train never finishes. Even if I try using --epochs 3, it just waits forever for "something".

Python1

This is where it is just waiting. CPU usage has dropped to 0% - and stays there.
Python2

Is there any way of seeing what it is waiting for?

@FrontierDK FrontierDK added the bug Something isn't working label Apr 19, 2022
@HarikalarKutusu
Copy link

We replicated the issue. I confirm the issue is existing on Colab, and so does @wasertech (ran on local machine).

@wasertech
Copy link
Collaborator

wasertech commented Apr 20, 2022

Yep check out the full logs here: https://gist.github.com/wasertech/1cbde2c1399148cf9954fc42d272265b.
The process hangs out at the end without returning to the shell. Makeing automation really difficult.

EDIT: I've also tried without the testing interface and this time it gets stuck after closing the session: https://gist.github.com/wasertech/4684fabe991718190a5c3245f0c0c187

@HarikalarKutusu
Copy link

HarikalarKutusu commented Apr 20, 2022

I checked with pip (v1.3.0), git clone v1.3.0 and git clone main all failed.
Could not pass another bug with v1.2.0 ( #2110 )
But it worked OK with git clone v1.1.0

@reuben
Copy link
Collaborator

reuben commented Apr 21, 2022

Can you provide a script to reproduce locally? Are you using GPUs/CUDA? Does this only happen on Collab?

@wasertech
Copy link
Collaborator

wasertech commented Apr 21, 2022

I can confirm it hangs on GPU with Docker and inside Colab too. I think @FrontierDK was using CPU but I'm not sure.
Run 3 epochs on any dataset with any batch size using python -m coqui_stt_training.train and weather you are using only the training interface or in conjonction with the testing one, the process never retunrns any exit code after done.
This behavior has only been noticed on 1.3.0 and above so far.

@FrontierDK
Copy link
Author

I think @FrontierDK was using CPU but I'm not sure.

I was/am using CPU - true :)

@wasertech
Copy link
Collaborator

wasertech commented Apr 21, 2022

So I tried to use the old train interface to make the same test and it still doesn't return any exit code. https://gist.github.com/wasertech/a7bd3ae2606e143bf70a540972c3314b

@wasertech
Copy link
Collaborator

it also hangs after doing only one epoch: https://gist.github.com/wasertech/e6a460532c0c8ee9f9ea4ed06073194f

@wasertech
Copy link
Collaborator

@reuben #2198 is a script to reproduce locally. Just run the script with some arguments like:

path_to_mailabs="/mnt/extracted/data/M-AILABS/"
mailabs_lang="fr_FR"
path_to_alphabet="/mnt/models/alphabet.txt"
path_to_scorer="/mnt/lm/kenlm.scorer"
./bin/run-ci-mailabs_time.sh ${path_to_mailabs} ${mailabs_lang} ${path_to_alphabet} ${path_to_scorer}

As you can see in my logs, it never reaches the lines bellow 39 to print the exec time.

@wasertech
Copy link
Collaborator

I managed to fix it in this branch by commenting out the memory test.

@wasertech
Copy link
Collaborator

So with #2205 you'll be able to skip the memory test which is causing this issue by using coqui_stt_training.train with the flag --skip_batch_test set to true.

@FrontierDK
Copy link
Author

When I try using --skip_batch_test=true I get train.py: error: unrecognized arguments: --skip_batch_test=true

@wasertech
Copy link
Collaborator

wasertech commented Apr 26, 2022

@FrontierDK Again, #2205 is not merged into main. As long as #2205 is open, it means the code has not merged with coqui yet. Wait for @reuben to accept my proposition or checkout my branch directly: https://github.com/wasertech/STT/tree/workaround_mem_test

@FrontierDK
Copy link
Author

FrontierDK commented Apr 26, 2022

My line
python3 -m coqui_stt_training.train --checkpoint_dir ~/coqui-stt-1.3.0-checkpoint --train_files ~/speech/sample.csv --dev_files ~/speech/dev.csv --test_files ~/speech/test.csv --n_hidden 2048 --load_cudnn false --epochs 3 --skip_batch_size true

Then I get train.py: error: unrecognized arguments: --skip_batch_size true

wasertech, I tried your fix yesterday and it worked :)

@wasertech
Copy link
Collaborator

So with #2205 merged, you can now use the —skip_batch_size flag when running coqui_stt_training.train module.

@FrontierDK
Copy link
Author

I don't think it has merged yet...I just tried installing a new VM and I still get this error:
train.py: error: unrecognized arguments: --skip_batch_size

This should have gotten the latest changes into the VM:
git clone --depth 1 --recurse-submodules --shallow-submodules https://github.com/coqui-ai/STT

@HarikalarKutusu
Copy link

Hmmm. Common Voice v9.0 is out and I want to start training some languages. It seems I will be working from a patched fork :(

@wasertech
Copy link
Collaborator

wasertech commented Apr 28, 2022

@FrontierDK It is merged. I've just rebuilt a docker container from main and tested the following:

❯ docker build --rm -f Dockerfile.train -t stt-train .
...
Successfully tagged stt-train:latest

❯ docker run -it --entrypoint 'bash' stt-train
root@3412bf80716e:/code# cat VERSION
1.4.0-alpha.1
root@3412bf80716e:/code# python -m coqui_stt_training.train --help
...
--skip_batch_test true/false
                        Coqpit Field: skip batch size memory test before
                        training
...

As expected. Full logs

@HarikalarKutusu Why? It doesn't work for you too?

@HarikalarKutusu
Copy link

@wasertech 👍 Thanks for the heads up, I didn't try yet, just assumed that it is not merged.

@FrontierDK
Copy link
Author

FrontierDK commented Apr 28, 2022

I just installed a new/fresh VM, still seeing the error here.

As yesterday, Coqui was installed with git clone --depth 1 --recurse-submodules --shallow-submodules https://github.com/coqui-ai/STT

Update: I need to copy/paste from the right places, just did skip_batch_test and it's not throwing an error now =)

@hammondm
Copy link

hammondm commented Apr 30, 2022

Has this change been made with the docker image yet?

this: ghcr.io/coqui-ai/stt-train:latest

Update: the pre-built image is not yet updated but, like the person below, I was able to rebuild it from Dockerfile.train in the latest repo.

@HarikalarKutusu
Copy link

I could get it with git clone main repo...

@wasertech
Copy link
Collaborator

Has this change been made with the docker image yet?

this: ghcr.io/coqui-ai/stt-train:latest

Update: the pre-built image is not yet updated but, like the person below, I was able to rebuild it from Dockerfile.train in the latest repo.

Well I found this image which should contain the latest fixes but it doesn’t bare the ‘latest’ tag.

@reuben
Copy link
Collaborator

reuben commented May 15, 2022

latest means latest stable release. We tag builds from main with the main tag as well: https://github.com/coqui-ai/STT/pkgs/container/stt-train/20747534?tag=main

wasertech added a commit to wasertech/STT that referenced this issue Jun 4, 2022
We don't want to test batch size here. It causes the process to never die (see coqui-ai#2195) and is not the intent of the test anyway.
@bernardohenz
Copy link

I'm facing the same issue: training just hangs out at the end of training.

I've tested with the following docker images:

  • ghcr.io/coqui-ai/stt-train:latest
  • ghcr.io/coqui-ai/stt-train:597e6ebb5c1a398c10b3a84810f93c1ce96d7926
  • ghcr.io/coqui-ai/stt-train:e7d2af96bd16556c7ef828a660acc881dd51c4b
  • docker building from master

All of them finished training but keep hanging there :(

@wasertech
Copy link
Collaborator

Yes, @bernardohenz. But I as I've said in my comments above:
Since STT 1.4.0, ...

..., you can now use the —skip_batch_size flag when running coqui_stt_training.train module.

This will skip the memory test that create this issue.

@bernardohenz
Copy link

Thanks @wasertech , I didn't get at the first, sorry 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants