Demanding CPU Utilization? #5

JeromeNi · 2022-01-10T19:21:32Z

I created my own dataset based on the provided templates. The training set consists of around 100h of audio/18000 utterances while the dev evaluation set consists of around 2000 utterances. I also extracted all mel-spectrograms beforehand instead of computing them on-the-fly. I trained with 1 GPU (V100).

First, I found that even loading the dev set takes around an hour. During training, the code sometimes just hangs at a single step, and during that time I see 100% CPU utilization for all the workers and GPU utilIization in nvidia-smi is 0%. I tried to set the num_workers to 0, 8, or 80 (total number of CPUs) and this happens for all three cases. With 80 workers, I only managed to do an initial validation check and two training epochs in around 10 hours.

Is it normal, and is there any way to speed it up?

Thanks for your help!

dhchoi99 · 2022-01-11T00:37:22Z

Yes, that also happened to me. I found out that parselmouth and praat augmentation hangs for some unknown reason. It seemed like the problem happens if unappropriate voice segment is given to parselmouth.

Switching the order of dataset code from 1. cropping the audio 2. augmenting the audio to 1. augmenting the audio 2. cropping the audio worked fine with me.
I fixed it at 2a234ba, so please try with most recent master branch.

dhchoi99 · 2022-02-09T03:57:14Z

After fixing issue of YannickJadoul/Parselmouth#68 (f7bddba),
dataloader speed seems to be a problem related to hardware spec.

For me, when using

1 Tesla V100 GPU + 32 Intel(R) Xeon(R) Silver 4110 CPU,
batch_size=32, num_workers=16,
without extracted mel-spectrogram
I had average [1.4s/it] while looping for the first epoch.

Although it may depend on each utterance length, your case sounds somewhat weird.
Could you share your hardware spec?

JeromeNi · 2022-02-12T23:27:11Z

I was previously using an IBM Power9 (ppc64le) node with 80 CPUs and 1 out of the 4 Tesla V100 on that node, as multi-GPU was stuck in initialization. I tried num_workers from 0 to 80 and it seemed that with more workers it was always faster, but never faster than 14s/it and it was prone to get stuck on some iteration for much much longer. However, that might be because I was adapting NANSY to 16kHz LibriSpeech utterances, and those were very long.

I haven't tried the newest commit here yet; I will see if the issue is resolved when some server bandwidth opens up between my current projects.

dhchoi99 closed this as completed in f7bddba Feb 9, 2022

dhchoi99 reopened this Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demanding CPU Utilization? #5

Demanding CPU Utilization? #5

JeromeNi commented Jan 10, 2022

dhchoi99 commented Jan 11, 2022

dhchoi99 commented Feb 9, 2022

JeromeNi commented Feb 12, 2022 •

edited

Loading

Demanding CPU Utilization? #5

Demanding CPU Utilization? #5

Comments

JeromeNi commented Jan 10, 2022

dhchoi99 commented Jan 11, 2022

dhchoi99 commented Feb 9, 2022

JeromeNi commented Feb 12, 2022 • edited Loading

JeromeNi commented Feb 12, 2022 •

edited

Loading