-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient get_seq_order_for_epoch() #568
More efficient get_seq_order_for_epoch() #568
Conversation
The reason why I'm working on this is that I started using multi-gpu training. In our current implementation the dataset is loaded once per GPU which makes reducing the memory footprint even more important. |
Converted to draft because I overlooked that what I actually needed was that My use case is: I have several huge HDF files, I create one HDFDataset per file and then combine via I have thought about writing a |
I haven't really looked into this too much yet. I just wanted to comment two ways how you could do shuffling on huge data:
|
Yes, what I'm doing is basically the first way. Except that HDFDataset is not iterator-like, however it has no overhead in representing an "almost infinite" map-like dataset because nothing is loaded on initialization. |
Actually, I think, at no point in the code we use anything else than |
5ac3d99
to
161a871
Compare
This should be acceptable now. The output of |
161a871
to
d398f9e
Compare
d398f9e
to
bf69c1c
Compare
bf69c1c
to
39d5c6f
Compare
I kind of forgot about this PR. @JackTemaki I did the requested changes, so please update the status. |
c6b863a
to
1c33c22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I think it's fine now.
Merging now, as I think @JackTemaki concerns were also addressed (and he is on vacation currently). |
If training on huge datasets (e.g. 100M sequences), storing a full list of indices for the sequence order uses a significant amount of memory, especially when using a plain Python list and not a numpy array. With
default
sequence ordering this can be avoided by using arange
instead. The full list is now only created after applying partition epoch, which usually means orders of magnitude less sequences.I also made creating the sequence orderings a bit faster by using numpy. Again, with huge datasets that can make a difference. The random seed will be different, so to say, before and after this commit, not sure whether this is a problem.
Unfortunately, the code now sometimes goes back and forth between lists and arrays, but emulating
list.sort(key=...)
with numpy also wouldn't be too nice.