Fast loading of cut audio in dataloader #955

desh2608 · 2023-01-25T20:00:36Z

desh2608
Jan 25, 2023
Collaborator

Recently, I have heard from several folks using the clusters at JHU about the following issue. Consider a situation where you are fine-tuning one of the SSL models (e.g. Hubert) on your own data. Suppose the input data is represented as cuts of long recording --- for example, the recordings are ~30min long and cuts may be ~10s. Since the models work with raw audio, we don't precompute any features, but just load the audio on-the-fly in the dataloader.

The problem seems to be that repeatedly fetching the full recording and extracting the cut segment from it creates an IO bottleneck since our clusters have a slow inter-node network. This leads to low GPU utilization. What would be the best strategy to overcome this? I was thinking if the Lhotse Shar archives may help here?

Since there are some folks who are new to Lhotse, it would be great to have a small example for using Lhotse Shar.

Answered by pzelasko

Jan 25, 2023

Cuts are already implemented this way, i.e. they load only the relevant subset of audio data from disk, not the full recording*. But that's often not nearly enough on slow clusters with magnetic disks and slow interconnects. Usually you end up getting bottlenecked by random access reads, which can be even 100x slower than sequential reads, because the recording/other data is fragmented all over a magnetic disk and it takes quite a while to physically find it.

Lhotse Shar is definitely an answer to that, but I currently can't find a spare moment to write up the tutorial. But in many cases it will be sufficient to use WebDataset which offers pretty much the same I/O speed-up advantages. Ple…

View full answer

desh2608 · 2023-01-25T20:01:33Z

desh2608
Jan 25, 2023
Collaborator Author

Tagging @m-wiesner and @efrathason.

0 replies

pzelasko · 2023-01-25T22:12:27Z

pzelasko
Jan 25, 2023
Maintainer

Cuts are already implemented this way, i.e. they load only the relevant subset of audio data from disk, not the full recording*. But that's often not nearly enough on slow clusters with magnetic disks and slow interconnects. Usually you end up getting bottlenecked by random access reads, which can be even 100x slower than sequential reads, because the recording/other data is fragmented all over a magnetic disk and it takes quite a while to physically find it.

Lhotse Shar is definitely an answer to that, but I currently can't find a spare moment to write up the tutorial. But in many cases it will be sufficient to use WebDataset which offers pretty much the same I/O speed-up advantages. Please check out the Lhotse+WebDataset tutorial to get started, it will definitely help with reading speeds on the CLSP cluster (note: the webdataset export will be slow, but it's a one-time cost).

* Disclaimer: unless you use the "command" AudioSource type (e.g. when importing data from Kaldi where wav.scp used complex pipes as inputs) in which case it is impossible to do a partial read. The code would emit a warning to keep the user informed in such cases.

3 replies

desh2608 Jan 26, 2023
Collaborator Author

Using precomputed features does not create this bottleneck, even if accessed randomly. I suppose the issue may be that the original wav files are stored on slower nodes.

pzelasko Jan 26, 2023
Maintainer

Precomputed features compressed with lilcom are usually 70% smaller and may be faster to decode than eg FLAC. Generally you can measure this by iterating the data loader and looking at time stamps between iterations.

pzelasko Jan 26, 2023
Maintainer

.. also the precomputed features generally don’t need to open new files all the time as the file handles are being cached.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast loading of cut audio in dataloader #955

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Fast loading of cut audio in dataloader #955

desh2608 Jan 25, 2023 Collaborator

Replies: 2 comments · 3 replies

desh2608 Jan 25, 2023 Collaborator Author

pzelasko Jan 25, 2023 Maintainer

desh2608 Jan 26, 2023 Collaborator Author

pzelasko Jan 26, 2023 Maintainer

pzelasko Jan 26, 2023 Maintainer

desh2608
Jan 25, 2023
Collaborator

Replies: 2 comments 3 replies

desh2608
Jan 25, 2023
Collaborator Author

pzelasko
Jan 25, 2023
Maintainer

desh2608 Jan 26, 2023
Collaborator Author

pzelasko Jan 26, 2023
Maintainer

pzelasko Jan 26, 2023
Maintainer