-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coqui_stt_training.train never finishes #2195
Comments
We replicated the issue. I confirm the issue is existing on Colab, and so does @wasertech (ran on local machine). |
Yep check out the full logs here: https://gist.github.com/wasertech/1cbde2c1399148cf9954fc42d272265b. EDIT: I've also tried without the testing interface and this time it gets stuck after closing the session: https://gist.github.com/wasertech/4684fabe991718190a5c3245f0c0c187 |
I checked with pip (v1.3.0), git clone v1.3.0 and git clone main all failed. |
Can you provide a script to reproduce locally? Are you using GPUs/CUDA? Does this only happen on Collab? |
I can confirm it hangs on GPU with Docker and inside Colab too. I think @FrontierDK was using CPU but I'm not sure. |
I was/am using CPU - true :) |
So I tried to use the old train interface to make the same test and it still doesn't return any exit code. https://gist.github.com/wasertech/a7bd3ae2606e143bf70a540972c3314b |
it also hangs after doing only one epoch: https://gist.github.com/wasertech/e6a460532c0c8ee9f9ea4ed06073194f |
@reuben #2198 is a script to reproduce locally. Just run the script with some arguments like: path_to_mailabs="/mnt/extracted/data/M-AILABS/"
mailabs_lang="fr_FR"
path_to_alphabet="/mnt/models/alphabet.txt"
path_to_scorer="/mnt/lm/kenlm.scorer"
./bin/run-ci-mailabs_time.sh ${path_to_mailabs} ${mailabs_lang} ${path_to_alphabet} ${path_to_scorer} As you can see in my logs, it never reaches the lines bellow 39 to print the exec time. |
I managed to fix it in this branch by commenting out the memory test. |
So with #2205 you'll be able to skip the memory test which is causing this issue by using |
When I try using --skip_batch_test=true I get train.py: error: unrecognized arguments: --skip_batch_test=true |
@FrontierDK Again, #2205 is not merged into |
My line Then I get train.py: error: unrecognized arguments: --skip_batch_size true wasertech, I tried your fix yesterday and it worked :) |
So with #2205 merged, you can now use the |
I don't think it has merged yet...I just tried installing a new VM and I still get this error: This should have gotten the latest changes into the VM: |
Hmmm. Common Voice v9.0 is out and I want to start training some languages. It seems I will be working from a patched fork :( |
@FrontierDK It is merged. I've just rebuilt a docker container from main and tested the following: ❯ docker build --rm -f Dockerfile.train -t stt-train .
...
Successfully tagged stt-train:latest
❯ docker run -it --entrypoint 'bash' stt-train
root@3412bf80716e:/code# cat VERSION
1.4.0-alpha.1
root@3412bf80716e:/code# python -m coqui_stt_training.train --help
...
--skip_batch_test true/false
Coqpit Field: skip batch size memory test before
training
... As expected. Full logs @HarikalarKutusu Why? It doesn't work for you too? |
@wasertech 👍 Thanks for the heads up, I didn't try yet, just assumed that it is not merged. |
I just installed a new/fresh VM, still seeing the error here. As yesterday, Coqui was installed with git clone --depth 1 --recurse-submodules --shallow-submodules https://github.com/coqui-ai/STT Update: I need to copy/paste from the right places, just did skip_batch_test and it's not throwing an error now =) |
Has this change been made with the docker image yet? this: ghcr.io/coqui-ai/stt-train:latest Update: the pre-built image is not yet updated but, like the person below, I was able to rebuild it from |
I could get it with git clone main repo... |
Well I found this image which should contain the latest fixes but it doesn’t bare the ‘latest’ tag. |
latest means latest stable release. We tag builds from main with the |
We don't want to test batch size here. It causes the process to never die (see coqui-ai#2195) and is not the intent of the test anyway.
I'm facing the same issue: training just hangs out at the end of training. I've tested with the following docker images:
All of them finished training but keep hanging there :( |
Yes, @bernardohenz. But I as I've said in my comments above:
This will skip the memory test that create this issue. |
Thanks @wasertech , I didn't get at the first, sorry 👍 |
Hi all.
I am having an issue where coqui_stt_training.train never finishes. Even if I try using --epochs 3, it just waits forever for "something".
This is where it is just waiting. CPU usage has dropped to 0% - and stays there.
Is there any way of seeing what it is waiting for?
The text was updated successfully, but these errors were encountered: