[experimental] Lazy manifest loading for huge datasets #286

pzelasko · 2021-04-27T02:56:49Z

This is an early draft and not a complete solution (but possibly not that far from it).

The issue being solved is that huge corpora, such as 40k hours English portion of MLS, cannot be handled well by Lhotse currently, as they consume a lot of time and memory to be read from disk (and to be created, but that'll be addressed separately).

Fortunately, there seems to be an elegant way to solve this issue with not too many changes to the library. Apache Arrow allows to mmap (compressed/uncompressed) JSONL files and iterate over them surprisingly quickly, avoiding the loading speed and memory issues. I've tried to explain how it works in the code comments; here I'll show some measurements instead.

The supervision set JSONL is 1GB large (even though it's compressed with gzip) and contains ~11M supervision segments. The "loading" time is ~13s wall time, which is very low for this size. Checking length does not cost anything, and retrieving a single item is very slow (~5s):

On the other hand, iteration is fairly fast, with dicts being deserialized into SupervisionSegments on the fly:

The next steps involve adjusting it further to work well with the samplers so that they can leverage the batched loads... the overhead coming from mmap shouldn't affect the training as it will happen in dataloader processes. I think I'll also need to add a utility to speed up creating the manifests and write them to disk in a streaming way. Then we should be able to fully support very large corpora.

danpovey · 2021-04-27T10:35:41Z

Interesting. I wonder how easy-to-install and how big a requirement pyarrow is.
I'm concerned about making the dependencies too big, and wondering whether it could be optional.

pzelasko · 2021-04-27T12:23:53Z

It is as simple as pip install pyarrow. We can make it optional.

…erator

…ookup of contiguous keys

pzelasko · 2021-04-27T20:06:09Z

I found out that the JSONL reading code did not use the mmap, but read everything into memory instead (although into some memory-efficient layout). 11M utterances take about 10~15GB of memory in such scenario, which is still not bad.

I added support to store the arrow table that we get from reading JSONL into something called "arrow streaming binary format" that does support mmap. It is ~5x larger on disk than the compressed JSONL (1.5G .jsonl.gz vs 5.3G .arrow), but I verified that the program uses only ~300MB of memory to read and iterate it. I don't support creating the binary format files directly (without going through JSONL serialization first) as that would have required me to explicitly add schemas for all manifests, and I don't want to do it yet (the JSON reading code has auto-discovery of schema but it's not exposed to Python API in pyarrow).

I also tweaked the LazyDict implementation to make it faster for sequential reads of single items; basically we cache a read chunk of manifests in memory and if they key is not found inside, we rotate a deque that holds pointer to chunks (called "batches" in arrow), read a new chunk, and try again. This allows me to collect 10~15 batches per second from a sampler with max_duration=600, which is slow compared to in-memory manifests, but maybe good enough to try training something this way and see if it needs more optimization.

pzelasko · 2021-04-28T13:21:15Z

I got a further 10x speedup for sequential access by using dict instead of pandas query for searching in chunks. I'll merge this as it should not affect any existing workflow, and will likely keep improving in other PRs.

ngoel17 · 2022-12-20T16:41:18Z

If I wanted to load the Huggingface arrow format dataset directly (such ad peoples_speech), is that also possible or will I need to export it as a manifest and then load into lhotse?

pzelasko · 2023-01-02T22:43:59Z

If I wanted to load the Huggingface arrow format dataset directly (such ad peoples_speech), is that also possible or will I need to export it as a manifest and then load into lhotse?

I removed the arrow support from Lhotse a while ago, but it was incompatible with what huggingface/datasets does anyway. If you want to leverage huggingface datasets the easiest way would be by implementing a wrapper class (similar to https://github.com/lhotse-speech/lhotse/blob/master/lhotse/lazy.py#L196) that iterates the HF dataset and converts it to a Lhotse Cut object on the fly (e.g. HuggingfaceDatasetIterator) and then creating a CutSet like cuts = CutSet(HuggingfaceDatasetIterator(args, ...))

pzelasko added 2 commits April 26, 2021 22:33

[experimental] Lazy loading of manifests with Apache Arrow

7c27a17

Add some comments

a1a2167

pzelasko added this to the v0.7 milestone Apr 27, 2021

pzelasko linked an issue Apr 27, 2021 that may be closed by this pull request

Lazy manifest loading #266

Closed

pzelasko added 7 commits April 27, 2021 12:07

Fixes to arr2list_recursive

4c3ce9e

Make arrow an optional dependency + extra docs

043ac6b

Add "to_dict()" to all manifests, make "to_dicts()" return a lazy gen…

9ad2441

…erator

Checking out a more optimized implementation of LazyDict for random l…

37227b3

…ookup of contiguous keys

Remove old lazy lookup implementation

9fc5eff

Change the rotation direction

e970a45

Add support for arrow streaming format serialization that uses mmap

2d671d8

pzelasko added 3 commits April 27, 2021 16:08

Add pyarrow to tests and make tests that use pyarrow optional

5e145af

Add test for accessing items by ID

a788059

Further speedup of __getitem__ in LazyDict for sequential access

785cd39

Update docs

894b154

pzelasko merged commit 26fa279 into master Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experimental] Lazy manifest loading for huge datasets #286

[experimental] Lazy manifest loading for huge datasets #286

pzelasko commented Apr 27, 2021

danpovey commented Apr 27, 2021

pzelasko commented Apr 27, 2021

pzelasko commented Apr 27, 2021

pzelasko commented Apr 28, 2021

ngoel17 commented Dec 20, 2022

pzelasko commented Jan 2, 2023 •

edited

Loading

[experimental] Lazy manifest loading for huge datasets #286

[experimental] Lazy manifest loading for huge datasets #286

Conversation

pzelasko commented Apr 27, 2021

danpovey commented Apr 27, 2021

pzelasko commented Apr 27, 2021

pzelasko commented Apr 27, 2021

pzelasko commented Apr 28, 2021

ngoel17 commented Dec 20, 2022

pzelasko commented Jan 2, 2023 • edited Loading

pzelasko commented Jan 2, 2023 •

edited

Loading