Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add oversampling strategy iterable datasets interleave #5036

Conversation

ylacombe
Copy link
Contributor

Hello everyone,
Following the issue #4893 and the PR #4831, I propose here an oversampling strategy for a IterableDataset list.
The all_exhausted strategy stops building the new dataset as soon as all samples in each dataset have been added at least once.
It follows roughly the same logic behind #4831, namely:

  • if probabilities is None and the strategy is all_exhausted, it simply performs a round robin interleaving that stops when the longest dataset is out of samples. Here the new dataset length will be $maxLengthDataset*nbDataset$.
  • if probabilities is not None and the strategy is all_exhausted, it keeps trace of the datasets which were out of samples but continues to add them to the new dataset, and stops as soons as every dataset runs out of samples at least once.

In order to be consistent and also to align with the Dataset behavior, please note that the behavior of the default strategy (first_exhausted) has been changed. Namely, it really stops when a dataset is out of samples whereas it used to stop when receiving the StopIteration error.
To give an example of the last note, with the following snippet:

>>> from tests.test_iterable_dataset import *
>>> d1 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [0, 1, 2]])), {}))
>>> d2 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [10, 11, 12, 13]])), {}))
>>> d3 = IterableDataset(ExamplesIterable((lambda: (yield from [(i, {"a": i}) for i in [20, 21, 22, 23, 24]])), {}))
>>> dataset = interleave_datasets([d1, d2, d3])
>>> [x["a"] for x in dataset]

The result here will then be [10, 0, 11, 1, 2] instead of [10, 0, 11, 1, 2, 20, 12, 13].

I modified the behavior because I found it to be consistent with the under/oversampling approach and because it unified the undersampling and oversampling code, but I stay open to any suggestions.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 28, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks ! Good idea to have HasNextIterator :)

I just have one comment:

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved
Remove resetting of empty iterators

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thank you :D

@lhoestq lhoestq merged commit 1529bdc into huggingface:main Sep 30, 2022
@ylacombe ylacombe deleted the add-oversampling-strategy-iterable-datasets-interleave branch September 30, 2022 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants