Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot shuffle interleaved IterableDataset with "all_exhausted" stopping strategy #5812

Closed
offchan42 opened this issue May 2, 2023 · 0 comments · Fixed by #5816
Closed
Assignees
Labels
bug Something isn't working streaming

Comments

@offchan42
Copy link

offchan42 commented May 2, 2023

Describe the bug

Shuffling interleaved IterableDataset with "all_exhausted" strategy yields non-exhaustive sampling.

Steps to reproduce the bug

from datasets import IterableDataset, interleave_datasets

def gen(bias, length):
  for i in range(length):
    yield dict(a=bias+i)

seed = 42
probabilities = [0.2, 0.6, 0.2]
d1 = IterableDataset.from_generator(lambda: gen(0, 3))
d2 = IterableDataset.from_generator(lambda: gen(10, 4))
d3 = IterableDataset.from_generator(lambda: gen(20, 3))
ds = interleave_datasets([d1, d2, d3], probabilities=probabilities, seed=seed, stopping_strategy='all_exhausted')
ds = ds.shuffle(buffer_size=1000)
for x in ds:
  print(x)

This code produces

{'a': 0}
{'a': 22}
{'a': 20}
{'a': 21}
{'a': 10}
{'a': 1}

Expected behavior

It should produce a longer list of examples to exhaust all the datasets.
If you comment out the shuffle line, it will exhaust all the datasets properly.
Here is the output if you comment out shuffling:

{'a': 10}
{'a': 11}
{'a': 20}
{'a': 12}
{'a': 0}
{'a': 21}
{'a': 13}
{'a': 10}
{'a': 1}
{'a': 11}
{'a': 12}
{'a': 22}
{'a': 13}
{'a': 20}
{'a': 10}
{'a': 11}
{'a': 12}
{'a': 2}

Environment info

  • datasets version: 2.12.0
  • Platform: Linux-5.10.147+-x86_64-with-glibc2.31
  • Python version: 3.10.11
  • Huggingface_hub version: 0.14.1
  • PyArrow version: 9.0.0
  • Pandas version: 1.5.3

This was run on Google Colab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working streaming
Projects
None yet
3 participants