More efficient get_seq_order_for_epoch() #568

patrick-wilken · 2021-08-09T14:37:24Z

If training on huge datasets (e.g. 100M sequences), storing a full list of indices for the sequence order uses a significant amount of memory, especially when using a plain Python list and not a numpy array. With default sequence ordering this can be avoided by using a range instead. The full list is now only created after applying partition epoch, which usually means orders of magnitude less sequences.
I also made creating the sequence orderings a bit faster by using numpy. Again, with huge datasets that can make a difference. The random seed will be different, so to say, before and after this commit, not sure whether this is a problem.
Unfortunately, the code now sometimes goes back and forth between lists and arrays, but emulating list.sort(key=...) with numpy also wouldn't be too nice.

patrick-wilken · 2021-08-09T14:40:32Z

The reason why I'm working on this is that I started using multi-gpu training. In our current implementation the dataset is loaded once per GPU which makes reducing the memory footprint even more important.

patrick-wilken · 2021-09-03T13:32:48Z

Converted to draft because I overlooked that what I actually needed was that get_seq_order_for_epoch() does not create the whole sequence list at all in case a range object would do. It works for me when I remove the seq_index = list(seq_index) line, but this will not work in general.

My use case is: I have several huge HDF files, I create one HDFDataset per file and then combine via CombinedDataset with sampling. I cannot afford to store sequence orderings for all full datasets as the total number of sequences is so high that it consumes too much memory. So I want to use "default" sequence ordering for the HDFDatasets, which can be represented by a python range, sample a constant amount of sequences from each of the HDFDatasets per epoch and only on CombinedDataset level create a sequence ordering list with e.g. 1M sequences.

I have thought about writing a SequenceOrdering class which would have the interface of a list, but internally would be more memory efficient if possible, mainly by using range for "default" and "reverse" sequence orderings. But somehow this seems to be too much code overhead for such a simple thing. Maybe allowing get_seq_order_for_epoch() return a range is easier after all...

albertz · 2021-09-03T14:11:27Z

I haven't really looked into this too much yet. I just wanted to comment two ways how you could do shuffling on huge data:

The way it is usually done in TensorFlow, on-the-fly. You have one (infinite) stream of data, and then a buffer, and keep shuffling the buffer. See our discussion on the TF dataset for this. It is an approximation but it works on infinitely large datasets.
You could just do random sampling on the fly. So whenever some new seq is requested, you randomly generate one index. So one epoch will not guarantee that you visit all seqs. But after iterating a couple of epochs, this approximation should also be fine.

patrick-wilken · 2021-09-03T14:29:00Z

Yes, what I'm doing is basically the first way. Except that HDFDataset is not iterator-like, however it has no overhead in representing an "almost infinite" map-like dataset because nothing is loaded on initialization.
With a few exceptions... 😄: the sequence ordering and a list of sequence start indices (which I was also able to get rid off by storing them inside the file instead of the sequence lengths, topic for a different PR...). So I now have a HDFDataset implementation where the memory consumption is really in no way dependent on the total amount of sequences in the file*, so no limit to the dataset size. 😎
* assuming default ordering

patrick-wilken · 2021-09-03T14:32:46Z

Actually, I think, at no point in the code we use anything else than len() and _getitem__ on the return value of get_seq_order_for_epoch(), so returning a range should already work. However we would have to change type docs to collections.abc.Sequence in many places...

…ible

patrick-wilken · 2021-10-14T17:33:45Z

This should be acceptable now. The output of get_seq_order_for_epoch() is now marked as typing.Sequence[int], it currently can be a list, a numpy array or a range, depending on what is the most efficient representation. In all our code we only use the typing.Sequence interface (__getitem__, __len__ and iteration) for the return value, except for that one place in CachedDataset where the equality operator is used, which did not work for numpy arrays.

returnn/datasets/basic.py

patrick-wilken · 2021-12-08T11:27:06Z

I kind of forgot about this PR. @JackTemaki I did the requested changes, so please update the status.

returnn/datasets/basic.py

albertz

Sorry for the delay. I think it's fine now.

albertz · 2021-12-22T13:45:07Z

Merging now, as I think @JackTemaki concerns were also addressed (and he is on vacation currently).

patrick-wilken requested review from albertz and a team as code owners August 9, 2021 14:37

patrick-wilken marked this pull request as draft September 3, 2021 13:11

patrick-wilken mentioned this pull request Oct 13, 2021

Partition epoch as a multi-GPU dataset distribution method #712

Open

patrick-wilken added 2 commits October 14, 2021 13:23

Added test of sequence orderings

7a0742d

get_seq_order_for_epoch: avoid creating full-epoch index list if poss…

33998ee

…ible

patrick-wilken force-pushed the feature/efficient_seq_order branch from 5ac3d99 to 161a871 Compare October 14, 2021 17:24

patrick-wilken marked this pull request as ready for review October 14, 2021 17:33

patrick-wilken requested a review from JackTemaki as a code owner October 14, 2021 17:33

patrick-wilken force-pushed the feature/efficient_seq_order branch from 161a871 to d398f9e Compare October 14, 2021 17:40

patrick-wilken added 2 commits October 14, 2021 14:42

get_seq_order_for_epoch: use numpy for efficiency

369f1a6

get_seq_order_for_epoch: moved code for 'random'

31266c3

patrick-wilken force-pushed the feature/efficient_seq_order branch from d398f9e to bf69c1c Compare October 14, 2021 18:42

JackTemaki requested changes Oct 15, 2021

View reviewed changes

returnn/datasets/basic.py Outdated Show resolved Hide resolved

patrick-wilken added 3 commits October 20, 2021 09:43

get_seq_order_for_epoch: cleaned up calculation of random seed

c85e296

Changed type annotations for seq_order to typing.Sequence

d778d8f

CachedDataset: support seq_order as numpy array

39d5c6f

patrick-wilken force-pushed the feature/efficient_seq_order branch from bf69c1c to 39d5c6f Compare October 20, 2021 13:50

albertz reviewed Dec 8, 2021

View reviewed changes

returnn/datasets/basic.py Outdated Show resolved Hide resolved

albertz reviewed Dec 8, 2021

View reviewed changes

returnn/datasets/basic.py Outdated Show resolved Hide resolved

albertz reviewed Dec 8, 2021

View reviewed changes

returnn/datasets/basic.py Outdated Show resolved Hide resolved

albertz reviewed Dec 8, 2021

View reviewed changes

returnn/datasets/basic.py Outdated Show resolved Hide resolved

patrick-wilken commented Dec 13, 2021

View reviewed changes

returnn/datasets/basic.py Outdated Show resolved Hide resolved

get_seq_order_for_epoch: avoid using global numpy random generator

1c33c22

patrick-wilken force-pushed the feature/efficient_seq_order branch from c6b863a to 1c33c22 Compare December 14, 2021 11:42

albertz approved these changes Dec 22, 2021

View reviewed changes

albertz merged commit ed0b381 into rwth-i6:master Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient get_seq_order_for_epoch() #568

More efficient get_seq_order_for_epoch() #568

patrick-wilken commented Aug 9, 2021

patrick-wilken commented Aug 9, 2021

patrick-wilken commented Sep 3, 2021

albertz commented Sep 3, 2021

patrick-wilken commented Sep 3, 2021 •

edited

Loading

patrick-wilken commented Sep 3, 2021

patrick-wilken commented Oct 14, 2021

patrick-wilken commented Dec 8, 2021

albertz left a comment

albertz commented Dec 22, 2021

More efficient get_seq_order_for_epoch() #568

More efficient get_seq_order_for_epoch() #568

Conversation

patrick-wilken commented Aug 9, 2021

patrick-wilken commented Aug 9, 2021

patrick-wilken commented Sep 3, 2021

albertz commented Sep 3, 2021

patrick-wilken commented Sep 3, 2021 • edited Loading

patrick-wilken commented Sep 3, 2021

patrick-wilken commented Oct 14, 2021

patrick-wilken commented Dec 8, 2021

albertz left a comment

Choose a reason for hiding this comment

albertz commented Dec 22, 2021

patrick-wilken commented Sep 3, 2021 •

edited

Loading