Custom binary feature storage format #522

pzelasko · 2021-12-22T22:33:46Z

Since HDF5 might not be the best choice for this project (we'll see how the discussion proceeds in #518), here is an alternative that doesn't use it.

It's a copy of ChunkedLilcomHdf5Writer that uses a plain binary file and stores the compressed array chunks next to each other. The storage_key keeps the global file offset for each array followed by a list of chunk relative offsets.

Seems to work on par with HDF5 storage we had so far: practically identical file sizes, seems to have roughly the same iteration speed when reading shuffled feats.

@danpovey you have experience with building formats like these in Kaldi (ark), any thoughts here?

danpovey · 2021-12-24T10:45:34Z

Oh- this seems pretty cool.
So it's just about removing the hdf5 dependency?
Do we use hdf5 much otherwise?

pzelasko · 2021-12-24T14:34:50Z

It’s about providing the same functionality without incurring a potentially large memory cost due to some opaque caching schemes of HDF5. So far at least two groups noticed constantly growing CPU RAM during each epoch and I pinpointed that to be HDF5 related.

I’m going to keep the h5py dependency for now to provide backward compatibility — the users will be able to use the features they have already precomputed. I might deprecate the HDF5 writers as soon as I’ve convinced myself that the new storage type is stable.

danpovey · 2021-12-24T15:15:02Z

Great!

…

On Fri, Dec 24, 2021 at 10:35 PM Piotr Żelasko ***@***.***> wrote: It’s about providing the same functionality without incurring a potentially large memory cost due to some opaque caching schemes of HDF5. So far at least two groups noticed constantly growing CPU RAM during each epoch and I pinpointed that to be HDF5 related. I’m going to keep the h5py dependency for now to provide backward compatibility — the users will be able to use the features they have already precomputed. I might deprecate the HDF5 writers as soon as I’ve convinced myself that the new storage type is stable. — Reply to this email directly, view it on GitHub <#522 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO4UWL23LGYD22X57YDUSSAJLANCNFSM5KTQLJAA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

danpovey · 2022-01-09T04:41:43Z

Piotr, do you think it's possible that as a quick fix for the constantly growing memory in hdf5, we could periodically just close all filehandles, e.g. every 100 batches? I'm not sure how easy this might be to do.
It might fix the memory-growing issue without requiring to re-dump data.

pzelasko · 2022-01-09T05:03:21Z

Piotr, do you think it's possible that as a quick fix for the constantly growing memory in hdf5, we could periodically just close all filehandles, e.g. every 100 batches? I'm not sure how easy this might be to do.
It might fix the memory-growing issue without requiring to re-dump data.

@danpovey I wrote an example showing how to do it; I'm not sure if it makes sense to merge this? For now I marked it as not for merge.
#527

danpovey · 2022-01-09T05:06:29Z

There is a conflict.

I suppose we could test this (but I don't know if we keep track of memory usage, so the only clear signal would be a training process not being killed, which is unpredictable anyway).

IMO there wouldn't be much downside in periodically closing filehandles if it's over a long enough period. I doubt re-opening files once every 100 batches will really impact performance much. Anyway the OS can do its caching in memory.

pzelasko · 2022-01-09T05:23:56Z

I fixed the conflict on this branch, but that fix for HDF5 memory is in another PR, here: #527

When I was debugging these issues, I tested this solution previously with a small script that just iterates the dataloader and does nothing with the data. The more often the files are closed, the less the memory seems to be used. The worst case I/O slowdown was 50% when closing the files after every mini-batch, otherwise it didn't seem to drastically degrade with larger intervals.

danpovey · 2022-01-09T05:38:27Z

Ah. So it seems to me that #527 could be merged with essentially no downside??
I know it's an ugly thing, but it seems to me that it resolves the problem and doesn't have negative effects.

pzelasko · 2022-01-09T13:49:46Z

When you put it like that it makes sense 😉 I’ll add a comment why this code is here later and merge

pzelasko · 2022-01-11T15:58:37Z

This thing seems to work with no memory/size/speed issues, which is now validated by someone else than me; merging.

pzelasko added 5 commits December 21, 2021 19:45

Experimental "fast" feature format without the HDF5 dependency

c7006cc

Remove memory hungry mmap

e7cb716

Smaller keys in manifests

da3215e

Merge branch 'master' into feature/fast-feature-format

6fd2869

Rename, make thread-safe

39a7859

pzelasko added 4 commits December 27, 2021 20:42

add methods and a CLI for copying features to a new storage

d168f04

Set new feature storage type as default; support appending

3c0c0aa

black

bd547dc

Support copying from multiple feature files to multiple feature files

51e990c

pzelasko mentioned this pull request Dec 29, 2021

GigaSpeech recipe k2-fsa/icefall#120

Merged

6 tasks

Merge branch 'master' into feature/fast-feature-format

c9f7643

pzelasko mentioned this pull request Jan 6, 2022

[WIP] add wenetspeech recipe k2-fsa/icefall#167

Closed

2 tasks

Merge branch 'master' into feature/fast-feature-format

ee1e52e

Merge branch 'master' into feature/fast-feature-format

be82930

pzelasko changed the title ~~[draft] Custom binary feature storage format~~ Custom binary feature storage format Jan 11, 2022

Merge branch 'master' into feature/fast-feature-format

bef5e1f

pzelasko added this to the v1.0 milestone Jan 11, 2022

pzelasko merged commit 84f3d44 into master Jan 11, 2022

pzelasko mentioned this pull request Jan 14, 2022

RAM keep rising during training #518

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom binary feature storage format #522

Custom binary feature storage format #522

pzelasko commented Dec 22, 2021 •

edited

Loading

danpovey commented Dec 24, 2021

pzelasko commented Dec 24, 2021

danpovey commented Dec 24, 2021 via email

danpovey commented Jan 9, 2022

pzelasko commented Jan 9, 2022

danpovey commented Jan 9, 2022

pzelasko commented Jan 9, 2022 •

edited

Loading

danpovey commented Jan 9, 2022

pzelasko commented Jan 9, 2022

pzelasko commented Jan 11, 2022

Custom binary feature storage format #522

Custom binary feature storage format #522

Conversation

pzelasko commented Dec 22, 2021 • edited Loading

danpovey commented Dec 24, 2021

pzelasko commented Dec 24, 2021

danpovey commented Dec 24, 2021 via email

danpovey commented Jan 9, 2022

pzelasko commented Jan 9, 2022

danpovey commented Jan 9, 2022

pzelasko commented Jan 9, 2022 • edited Loading

danpovey commented Jan 9, 2022

pzelasko commented Jan 9, 2022

pzelasko commented Jan 11, 2022

pzelasko commented Dec 22, 2021 •

edited

Loading

pzelasko commented Jan 9, 2022 •

edited

Loading