-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add oversampling strategy iterable datasets interleave #5036
Add oversampling strategy iterable datasets interleave #5036
Conversation
…for IterableDatasets
…gMultiSourcesExamplesIterable to avoid code redundancy
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks ! Good idea to have HasNextIterator :)
I just have one comment:
Remove resetting of empty iterators Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thank you :D
Hello everyone,
Following the issue #4893 and the PR #4831, I propose here an oversampling strategy for a
IterableDataset
list.The
all_exhausted
strategy stops building the new dataset as soon as all samples in each dataset have been added at least once.It follows roughly the same logic behind #4831, namely:
probabilities
isNone
and the strategy isall_exhausted
, it simply performs a round robin interleaving that stops when the longest dataset is out of samples. Here the new dataset length will beprobabilities
is notNone
and the strategy isall_exhausted
, it keeps trace of the datasets which were out of samples but continues to add them to the new dataset, and stops as soons as every dataset runs out of samples at least once.In order to be consistent and also to align with the
Dataset
behavior, please note that the behavior of the default strategy (first_exhausted
) has been changed. Namely, it really stops when a dataset is out of samples whereas it used to stop when receiving theStopIteration
error.To give an example of the last note, with the following snippet:
The result here will then be
[10, 0, 11, 1, 2]
instead of[10, 0, 11, 1, 2, 20, 12, 13]
.I modified the behavior because I found it to be consistent with the under/oversampling approach and because it unified the undersampling and oversampling code, but I stay open to any suggestions.