-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset.shuffle(keep_in_memory=True) is never allowed #514
Comments
Maybe I'm a bit tired but I fail to see the issue here. Since |
I failed to realise that this only applies to Thus, Easily reproducible: >>> import nlp
>>> data = nlp.load_dataset("cosmos_qa", split="train")
Using custom data configuration default
>>> data.shuffle(keep_in_memory=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vegarab/.conda/envs/torch/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 1398, in shuffle
verbose=verbose,
File "/home/vegarab/.conda/envs/torch/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 1178, in select
), "Please use either `keep_in_memory` or `cache_file_name` but not both."
AssertionError: Please use either `keep_in_memory` or `cache_file_name` but not both.
>>>data.select([0], keep_in_memory=True)
# No error |
Oh yes ok got it thanks. Should be fixed if we are happy with #513 indeed. |
My bad. This is actually not fixed in #513. Sorry about that... The buffer and caching mechanisms used in the |
Ok I'll investigate and add a series of tests on the |
Hey, still seeing this issue with the latest version. |
The same :( |
These are the steps needed to fix this issue:
if keep_in_memory and indices_cache_file_name is not None:
raise ValueError("Please use either `keep_in_memory` or `indices_cache_file_name` but not both.")
|
Hi @mariosasko , I have opened this PR #5082 |
As of commit ef4aac2, the usage of the parameter
keep_in_memory=True
is never possible:dataset.select(keep_in_memory=True)
The commit added the lines
This affects both
shuffle()
asselect()
is a sub-routine, andmap()
that has the same check.I'd love to fix this myself, but unsure what the intention of the assert is given the rest of the logic in the function concerning
ccache_file_name
andkeep_in_memory
.The text was updated successfully, but these errors were encountered: