-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom binary feature storage format #522
Conversation
Oh- this seems pretty cool. |
It’s about providing the same functionality without incurring a potentially large memory cost due to some opaque caching schemes of HDF5. So far at least two groups noticed constantly growing CPU RAM during each epoch and I pinpointed that to be HDF5 related. I’m going to keep the h5py dependency for now to provide backward compatibility — the users will be able to use the features they have already precomputed. I might deprecate the HDF5 writers as soon as I’ve convinced myself that the new storage type is stable. |
Great!
…On Fri, Dec 24, 2021 at 10:35 PM Piotr Żelasko ***@***.***> wrote:
It’s about providing the same functionality without incurring a
potentially large memory cost due to some opaque caching schemes of HDF5.
So far at least two groups noticed constantly growing CPU RAM during each
epoch and I pinpointed that to be HDF5 related.
I’m going to keep the h5py dependency for now to provide backward
compatibility — the users will be able to use the features they have
already precomputed. I might deprecate the HDF5 writers as soon as I’ve
convinced myself that the new storage type is stable.
—
Reply to this email directly, view it on GitHub
<#522 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO4UWL23LGYD22X57YDUSSAJLANCNFSM5KTQLJAA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Piotr, do you think it's possible that as a quick fix for the constantly growing memory in hdf5, we could periodically just close all filehandles, e.g. every 100 batches? I'm not sure how easy this might be to do. |
@danpovey I wrote an example showing how to do it; I'm not sure if it makes sense to merge this? For now I marked it as not for merge. |
There is a conflict. I suppose we could test this (but I don't know if we keep track of memory usage, so the only clear signal would be a training process not being killed, which is unpredictable anyway). IMO there wouldn't be much downside in periodically closing filehandles if it's over a long enough period. I doubt re-opening files once every 100 batches will really impact performance much. Anyway the OS can do its caching in memory. |
I fixed the conflict on this branch, but that fix for HDF5 memory is in another PR, here: #527 When I was debugging these issues, I tested this solution previously with a small script that just iterates the dataloader and does nothing with the data. The more often the files are closed, the less the memory seems to be used. The worst case I/O slowdown was 50% when closing the files after every mini-batch, otherwise it didn't seem to drastically degrade with larger intervals. |
Ah. So it seems to me that #527 could be merged with essentially no downside?? |
When you put it like that it makes sense 😉 I’ll add a comment why this code is here later and merge |
This thing seems to work with no memory/size/speed issues, which is now validated by someone else than me; merging. |
Since HDF5 might not be the best choice for this project (we'll see how the discussion proceeds in #518), here is an alternative that doesn't use it.
It's a copy of
ChunkedLilcomHdf5Writer
that uses a plain binary file and stores the compressed array chunks next to each other. Thestorage_key
keeps the global file offset for each array followed by a list of chunk relative offsets.Seems to work on par with HDF5 storage we had so far: practically identical file sizes, seems to have roughly the same iteration speed when reading shuffled feats.
@danpovey you have experience with building formats like these in Kaldi (
ark
), any thoughts here?