Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient get_seq_order_for_epoch() #568

Merged
merged 8 commits into from
Dec 22, 2021

Conversation

patrick-wilken
Copy link
Contributor

If training on huge datasets (e.g. 100M sequences), storing a full list of indices for the sequence order uses a significant amount of memory, especially when using a plain Python list and not a numpy array. With default sequence ordering this can be avoided by using a range instead. The full list is now only created after applying partition epoch, which usually means orders of magnitude less sequences.
I also made creating the sequence orderings a bit faster by using numpy. Again, with huge datasets that can make a difference. The random seed will be different, so to say, before and after this commit, not sure whether this is a problem.
Unfortunately, the code now sometimes goes back and forth between lists and arrays, but emulating list.sort(key=...) with numpy also wouldn't be too nice.

@patrick-wilken patrick-wilken requested review from albertz and a team as code owners August 9, 2021 14:37
@patrick-wilken
Copy link
Contributor Author

The reason why I'm working on this is that I started using multi-gpu training. In our current implementation the dataset is loaded once per GPU which makes reducing the memory footprint even more important.

@patrick-wilken patrick-wilken marked this pull request as draft September 3, 2021 13:11
@patrick-wilken
Copy link
Contributor Author

Converted to draft because I overlooked that what I actually needed was that get_seq_order_for_epoch() does not create the whole sequence list at all in case a range object would do. It works for me when I remove the seq_index = list(seq_index) line, but this will not work in general.

My use case is: I have several huge HDF files, I create one HDFDataset per file and then combine via CombinedDataset with sampling. I cannot afford to store sequence orderings for all full datasets as the total number of sequences is so high that it consumes too much memory. So I want to use "default" sequence ordering for the HDFDatasets, which can be represented by a python range, sample a constant amount of sequences from each of the HDFDatasets per epoch and only on CombinedDataset level create a sequence ordering list with e.g. 1M sequences.

I have thought about writing a SequenceOrdering class which would have the interface of a list, but internally would be more memory efficient if possible, mainly by using range for "default" and "reverse" sequence orderings. But somehow this seems to be too much code overhead for such a simple thing. Maybe allowing get_seq_order_for_epoch() return a range is easier after all...

@albertz
Copy link
Member

albertz commented Sep 3, 2021

I haven't really looked into this too much yet. I just wanted to comment two ways how you could do shuffling on huge data:

  • The way it is usually done in TensorFlow, on-the-fly. You have one (infinite) stream of data, and then a buffer, and keep shuffling the buffer. See our discussion on the TF dataset for this. It is an approximation but it works on infinitely large datasets.
  • You could just do random sampling on the fly. So whenever some new seq is requested, you randomly generate one index. So one epoch will not guarantee that you visit all seqs. But after iterating a couple of epochs, this approximation should also be fine.

@patrick-wilken
Copy link
Contributor Author

patrick-wilken commented Sep 3, 2021

Yes, what I'm doing is basically the first way. Except that HDFDataset is not iterator-like, however it has no overhead in representing an "almost infinite" map-like dataset because nothing is loaded on initialization.
With a few exceptions... 😄: the sequence ordering and a list of sequence start indices (which I was also able to get rid off by storing them inside the file instead of the sequence lengths, topic for a different PR...). So I now have a HDFDataset implementation where the memory consumption is really in no way dependent on the total amount of sequences in the file*, so no limit to the dataset size. 😎
* assuming default ordering

@patrick-wilken
Copy link
Contributor Author

Actually, I think, at no point in the code we use anything else than len() and _getitem__ on the return value of get_seq_order_for_epoch(), so returning a range should already work. However we would have to change type docs to collections.abc.Sequence in many places...

@patrick-wilken patrick-wilken force-pushed the feature/efficient_seq_order branch from 5ac3d99 to 161a871 Compare October 14, 2021 17:24
@patrick-wilken
Copy link
Contributor Author

This should be acceptable now. The output of get_seq_order_for_epoch() is now marked as typing.Sequence[int], it currently can be a list, a numpy array or a range, depending on what is the most efficient representation. In all our code we only use the typing.Sequence interface (__getitem__, __len__ and iteration) for the return value, except for that one place in CachedDataset where the equality operator is used, which did not work for numpy arrays.

@patrick-wilken patrick-wilken marked this pull request as ready for review October 14, 2021 17:33
@patrick-wilken patrick-wilken force-pushed the feature/efficient_seq_order branch from 161a871 to d398f9e Compare October 14, 2021 17:40
@patrick-wilken patrick-wilken force-pushed the feature/efficient_seq_order branch from d398f9e to bf69c1c Compare October 14, 2021 18:42
returnn/datasets/basic.py Outdated Show resolved Hide resolved
@patrick-wilken patrick-wilken force-pushed the feature/efficient_seq_order branch from bf69c1c to 39d5c6f Compare October 20, 2021 13:50
@patrick-wilken
Copy link
Contributor Author

I kind of forgot about this PR. @JackTemaki I did the requested changes, so please update the status.

returnn/datasets/basic.py Outdated Show resolved Hide resolved
returnn/datasets/basic.py Outdated Show resolved Hide resolved
returnn/datasets/basic.py Outdated Show resolved Hide resolved
returnn/datasets/basic.py Outdated Show resolved Hide resolved
@patrick-wilken patrick-wilken force-pushed the feature/efficient_seq_order branch from c6b863a to 1c33c22 Compare December 14, 2021 11:42
Copy link
Member

@albertz albertz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I think it's fine now.

@albertz
Copy link
Member

albertz commented Dec 22, 2021

Merging now, as I think @JackTemaki concerns were also addressed (and he is on vacation currently).

@albertz albertz merged commit ed0b381 into rwth-i6:master Dec 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants