-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[experimental] Lazy manifest loading for huge datasets #286
Conversation
Interesting. I wonder how easy-to-install and how big a requirement pyarrow is. |
It is as simple as pip install pyarrow. We can make it optional. |
…ookup of contiguous keys
I found out that the JSONL reading code did not use the mmap, but read everything into memory instead (although into some memory-efficient layout). 11M utterances take about 10~15GB of memory in such scenario, which is still not bad. I added support to store the arrow table that we get from reading JSONL into something called "arrow streaming binary format" that does support mmap. It is ~5x larger on disk than the compressed JSONL (1.5G .jsonl.gz vs 5.3G .arrow), but I verified that the program uses only ~300MB of memory to read and iterate it. I don't support creating the binary format files directly (without going through JSONL serialization first) as that would have required me to explicitly add schemas for all manifests, and I don't want to do it yet (the JSON reading code has auto-discovery of schema but it's not exposed to Python API in pyarrow). I also tweaked the |
I got a further 10x speedup for sequential access by using dict instead of pandas query for searching in chunks. I'll merge this as it should not affect any existing workflow, and will likely keep improving in other PRs. |
If I wanted to load the Huggingface arrow format dataset directly (such ad peoples_speech), is that also possible or will I need to export it as a manifest and then load into lhotse? |
I removed the arrow support from Lhotse a while ago, but it was incompatible with what huggingface/datasets does anyway. If you want to leverage huggingface datasets the easiest way would be by implementing a wrapper class (similar to https://github.com/lhotse-speech/lhotse/blob/master/lhotse/lazy.py#L196) that iterates the HF dataset and converts it to a Lhotse Cut object on the fly (e.g. |
This is an early draft and not a complete solution (but possibly not that far from it).
The issue being solved is that huge corpora, such as 40k hours English portion of MLS, cannot be handled well by Lhotse currently, as they consume a lot of time and memory to be read from disk (and to be created, but that'll be addressed separately).
Fortunately, there seems to be an elegant way to solve this issue with not too many changes to the library. Apache Arrow allows to mmap (compressed/uncompressed) JSONL files and iterate over them surprisingly quickly, avoiding the loading speed and memory issues. I've tried to explain how it works in the code comments; here I'll show some measurements instead.
The supervision set JSONL is 1GB large (even though it's compressed with gzip) and contains ~11M supervision segments. The "loading" time is ~13s wall time, which is very low for this size. Checking length does not cost anything, and retrieving a single item is very slow (~5s):
On the other hand, iteration is fairly fast, with dicts being deserialized into SupervisionSegments on the fly:
The next steps involve adjusting it further to work well with the samplers so that they can leverage the batched loads... the overhead coming from mmap shouldn't affect the training as it will happen in dataloader processes. I think I'll also need to add a utility to speed up creating the manifests and write them to disk in a streaming way. Then we should be able to fully support very large corpora.