Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom binary feature storage format #522

Merged
merged 13 commits into from
Jan 11, 2022
Merged

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Dec 22, 2021

Since HDF5 might not be the best choice for this project (we'll see how the discussion proceeds in #518), here is an alternative that doesn't use it.

It's a copy of ChunkedLilcomHdf5Writer that uses a plain binary file and stores the compressed array chunks next to each other. The storage_key keeps the global file offset for each array followed by a list of chunk relative offsets.

Seems to work on par with HDF5 storage we had so far: practically identical file sizes, seems to have roughly the same iteration speed when reading shuffled feats.

@danpovey you have experience with building formats like these in Kaldi (ark), any thoughts here?

@danpovey
Copy link
Collaborator

Oh- this seems pretty cool.
So it's just about removing the hdf5 dependency?
Do we use hdf5 much otherwise?

@pzelasko
Copy link
Collaborator Author

It’s about providing the same functionality without incurring a potentially large memory cost due to some opaque caching schemes of HDF5. So far at least two groups noticed constantly growing CPU RAM during each epoch and I pinpointed that to be HDF5 related.

I’m going to keep the h5py dependency for now to provide backward compatibility — the users will be able to use the features they have already precomputed. I might deprecate the HDF5 writers as soon as I’ve convinced myself that the new storage type is stable.

@danpovey
Copy link
Collaborator

danpovey commented Dec 24, 2021 via email

@pzelasko pzelasko mentioned this pull request Dec 29, 2021
6 tasks
@danpovey
Copy link
Collaborator

danpovey commented Jan 9, 2022

Piotr, do you think it's possible that as a quick fix for the constantly growing memory in hdf5, we could periodically just close all filehandles, e.g. every 100 batches? I'm not sure how easy this might be to do.
It might fix the memory-growing issue without requiring to re-dump data.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jan 9, 2022

Piotr, do you think it's possible that as a quick fix for the constantly growing memory in hdf5, we could periodically just close all filehandles, e.g. every 100 batches? I'm not sure how easy this might be to do.
It might fix the memory-growing issue without requiring to re-dump data.

@danpovey I wrote an example showing how to do it; I'm not sure if it makes sense to merge this? For now I marked it as not for merge.
#527

@danpovey
Copy link
Collaborator

danpovey commented Jan 9, 2022

There is a conflict.

I suppose we could test this (but I don't know if we keep track of memory usage, so the only clear signal would be a training process not being killed, which is unpredictable anyway).

IMO there wouldn't be much downside in periodically closing filehandles if it's over a long enough period. I doubt re-opening files once every 100 batches will really impact performance much. Anyway the OS can do its caching in memory.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jan 9, 2022

I fixed the conflict on this branch, but that fix for HDF5 memory is in another PR, here: #527

When I was debugging these issues, I tested this solution previously with a small script that just iterates the dataloader and does nothing with the data. The more often the files are closed, the less the memory seems to be used. The worst case I/O slowdown was 50% when closing the files after every mini-batch, otherwise it didn't seem to drastically degrade with larger intervals.

@danpovey
Copy link
Collaborator

danpovey commented Jan 9, 2022

Ah. So it seems to me that #527 could be merged with essentially no downside??
I know it's an ugly thing, but it seems to me that it resolves the problem and doesn't have negative effects.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jan 9, 2022

When you put it like that it makes sense 😉 I’ll add a comment why this code is here later and merge

@pzelasko pzelasko changed the title [draft] Custom binary feature storage format Custom binary feature storage format Jan 11, 2022
@pzelasko
Copy link
Collaborator Author

This thing seems to work with no memory/size/speed issues, which is now validated by someone else than me; merging.

@pzelasko pzelasko added this to the v1.0 milestone Jan 11, 2022
@pzelasko pzelasko merged commit 84f3d44 into master Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants