Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible workaround for HDF5 growing memory issue #527

Merged
merged 5 commits into from
Jan 10, 2022

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jan 9, 2022

Intended as a workaround for growing memory issues when using HDF5 with larger datasets.

@@ -92,6 +93,10 @@ def __getitem__(self, cuts: CutSet) -> Dict[str, Union[torch.Tensor, List[str]]]
"""
validate_for_asr(cuts)

if self.batch_counter > 0 and self.batch_counter % 100 == 0:
close_cached_file_handles()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we reset self.batch_counter in case it overflows if it runs for enough time?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python uses special integers that can never overflow, I believe.

@pzelasko pzelasko changed the title [do-not-merge] Example showing how to close file handles every N batches Possible workaround for HDF5 growing memory issue Jan 10, 2022
@pzelasko
Copy link
Collaborator Author

I cleaned up the code and added some docs. I'd appreciate if somebody can test it out on a large scale training and confirm that it does help; I was only able to test on small-to-medium sized problems so far.

…-n-batches' into feature/close-file-handles-every-n-batches
@pzelasko pzelasko merged commit 79867a5 into master Jan 10, 2022
@pzelasko pzelasko added this to the v1.0 milestone Jan 14, 2022
@pzelasko
Copy link
Collaborator Author

In case anybody reads this: this workaround doesn't completely solve the issue, and the memory keeps growing, just slower. There also seem to be random spikes in memory usage around the time the HDF5 files are open/closed.

I generally don't advise using LilcomHdf5Writer for data larger than ~1k hours as it will likely blow up the memory at some point.

@desh2608
Copy link
Collaborator

@pzelasko which is the recommended storage type for large data (5-10k hours)?

@pzelasko
Copy link
Collaborator Author

The default one (LilcomChunkyWriter) should work as well as HDF5 but without memory leaks.

@cyrta
Copy link

cyrta commented Feb 25, 2022

And what about cloud storage ?

@pzelasko
Copy link
Collaborator Author

The recent PRs with WebDataset integration work really well with that; please refer to #582, #599, and #602. You'll need to read the documentation in the code because we don't have a high-level tutorial for these workflows yet.

@pzelasko
Copy link
Collaborator Author

... alternatively, there is also LilcomUrlWriter for features and AudioSource(type="url", ...) for audio which you can use with any cloud storage supported by smart_open library. It will be much less I/O efficient because it will spawn a new connection for each item, but for some types of use-cases might be viable. But WebDataset-based flow would be roughly 50-100x faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants