Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental] Lazy manifest loading for huge datasets #286

Merged
merged 13 commits into from
Apr 28, 2021

Conversation

pzelasko
Copy link
Collaborator

This is an early draft and not a complete solution (but possibly not that far from it).

The issue being solved is that huge corpora, such as 40k hours English portion of MLS, cannot be handled well by Lhotse currently, as they consume a lot of time and memory to be read from disk (and to be created, but that'll be addressed separately).

Fortunately, there seems to be an elegant way to solve this issue with not too many changes to the library. Apache Arrow allows to mmap (compressed/uncompressed) JSONL files and iterate over them surprisingly quickly, avoiding the loading speed and memory issues. I've tried to explain how it works in the code comments; here I'll show some measurements instead.

The supervision set JSONL is 1GB large (even though it's compressed with gzip) and contains ~11M supervision segments. The "loading" time is ~13s wall time, which is very low for this size. Checking length does not cost anything, and retrieving a single item is very slow (~5s):
image

On the other hand, iteration is fairly fast, with dicts being deserialized into SupervisionSegments on the fly:
image

The next steps involve adjusting it further to work well with the samplers so that they can leverage the batched loads... the overhead coming from mmap shouldn't affect the training as it will happen in dataloader processes. I think I'll also need to add a utility to speed up creating the manifests and write them to disk in a streaming way. Then we should be able to fully support very large corpora.

@pzelasko pzelasko added this to the v0.7 milestone Apr 27, 2021
@pzelasko pzelasko linked an issue Apr 27, 2021 that may be closed by this pull request
@danpovey
Copy link
Collaborator

Interesting. I wonder how easy-to-install and how big a requirement pyarrow is.
I'm concerned about making the dependencies too big, and wondering whether it could be optional.

@pzelasko
Copy link
Collaborator Author

It is as simple as pip install pyarrow. We can make it optional.

@pzelasko
Copy link
Collaborator Author

I found out that the JSONL reading code did not use the mmap, but read everything into memory instead (although into some memory-efficient layout). 11M utterances take about 10~15GB of memory in such scenario, which is still not bad.

I added support to store the arrow table that we get from reading JSONL into something called "arrow streaming binary format" that does support mmap. It is ~5x larger on disk than the compressed JSONL (1.5G .jsonl.gz vs 5.3G .arrow), but I verified that the program uses only ~300MB of memory to read and iterate it. I don't support creating the binary format files directly (without going through JSONL serialization first) as that would have required me to explicitly add schemas for all manifests, and I don't want to do it yet (the JSON reading code has auto-discovery of schema but it's not exposed to Python API in pyarrow).

I also tweaked the LazyDict implementation to make it faster for sequential reads of single items; basically we cache a read chunk of manifests in memory and if they key is not found inside, we rotate a deque that holds pointer to chunks (called "batches" in arrow), read a new chunk, and try again. This allows me to collect 10~15 batches per second from a sampler with max_duration=600, which is slow compared to in-memory manifests, but maybe good enough to try training something this way and see if it needs more optimization.

@pzelasko
Copy link
Collaborator Author

I got a further 10x speedup for sequential access by using dict instead of pandas query for searching in chunks. I'll merge this as it should not affect any existing workflow, and will likely keep improving in other PRs.

@pzelasko pzelasko merged commit 26fa279 into master Apr 28, 2021
@ngoel17
Copy link

ngoel17 commented Dec 20, 2022

If I wanted to load the Huggingface arrow format dataset directly (such ad peoples_speech), is that also possible or will I need to export it as a manifest and then load into lhotse?

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jan 2, 2023

If I wanted to load the Huggingface arrow format dataset directly (such ad peoples_speech), is that also possible or will I need to export it as a manifest and then load into lhotse?

I removed the arrow support from Lhotse a while ago, but it was incompatible with what huggingface/datasets does anyway. If you want to leverage huggingface datasets the easiest way would be by implementing a wrapper class (similar to https://github.com/lhotse-speech/lhotse/blob/master/lhotse/lazy.py#L196) that iterates the HF dataset and converts it to a Lhotse Cut object on the fly (e.g. HuggingfaceDatasetIterator) and then creating a CutSet like cuts = CutSet(HuggingfaceDatasetIterator(args, ...))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lazy manifest loading
3 participants