Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bookcorpusopen no longer works #3167

Closed
lucadiliello opened this issue Oct 26, 2021 · 3 comments · Fixed by #3280
Closed

bookcorpusopen no longer works #3167

lucadiliello opened this issue Oct 26, 2021 · 3 comments · Fixed by #3280
Assignees
Labels
bug Something isn't working

Comments

@lucadiliello
Copy link
Contributor

Describe the bug

When using the latest version of datasets (1.14.0), I cannot use the bookcorpusopen dataset. The process blocks always around 9924 examples [00:06, 1439.61 examples/s] when preparing the dataset. I also noticed that after half an hour the process is automatically killed because of the RAM usage (the machine has 1TB of RAM...).

This did not happen with 1.4.1.
I tried also rm -rf ~/.cache/huggingface but did not help.
Changing python version between 3.7, 3.8 and 3.9 did not help too.

Steps to reproduce the bug

import datasets
d = datasets.load_dataset('bookcorpusopen')

Expected results

A clear and concise description of the expected results.

Actual results

Specify the actual results or traceback.

Environment info

  • datasets version: 1.14.0
  • Platform: Linux-5.4.0-1054-aws-x86_64-with-glibc2.27
  • Python version: 3.9.7
  • PyArrow version: 4.0.1
@lucadiliello lucadiliello added the bug Something isn't working label Oct 26, 2021
@lhoestq
Copy link
Member

lhoestq commented Nov 16, 2021

Hi ! Thanks for reporting :) I think #3280 should fix this

@lhoestq lhoestq self-assigned this Nov 16, 2021
@lhoestq
Copy link
Member

lhoestq commented Nov 16, 2021

I tried with the latest changes from #3280 on google colab and it worked fine :)
We'll do a new release soon, in the meantime you can use the updated version with:

load_dataset("bookcorpusopen", revision="master")

@albertvillanova
Copy link
Member

Fixed by #3280.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants