Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset.shuffle(keep_in_memory=True) is never allowed #514

Closed
vegarab opened this issue Aug 18, 2020 · 10 comments
Closed

dataset.shuffle(keep_in_memory=True) is never allowed #514

vegarab opened this issue Aug 18, 2020 · 10 comments
Labels

Comments

@vegarab
Copy link
Contributor

vegarab commented Aug 18, 2020

As of commit ef4aac2, the usage of the parameter keep_in_memory=True is never possible: dataset.select(keep_in_memory=True)

The commit added the lines

# lines 994-996 in src/nlp/arrow_dataset.py
       assert (
            not keep_in_memory or cache_file_name is None
        ), "Please use either `keep_in_memory` or `cache_file_name` but not both."

This affects both shuffle() as select() is a sub-routine, and map() that has the same check.

I'd love to fix this myself, but unsure what the intention of the assert is given the rest of the logic in the function concerning ccache_file_name and keep_in_memory.

@vegarab vegarab closed this as completed Aug 18, 2020
@vegarab vegarab reopened this Aug 18, 2020
@vegarab
Copy link
Contributor Author

vegarab commented Aug 18, 2020

This seems to be fixed in #513 for the filter function, replacing cache_file_name with indices_cache_file_name in the assert. Although not for the map() function @thomwolf

@thomwolf
Copy link
Member

Maybe I'm a bit tired but I fail to see the issue here.

Since cache_file_name is None by default, if you set keep_in_memory to True, the assert should pass, no?

@vegarab
Copy link
Contributor Author

vegarab commented Aug 18, 2020

I failed to realise that this only applies to shuffle(). Whenever keep_in_memory is set to True, this is passed on to the select() function. However, if cache_file_name is None, it will be defined in the shuffle() function before it is passed on to select().

Thus, select() is called with keep_in_memory=True and a not None value for cache_file_name.
This is essentially fixed in #513

Easily reproducible:

>>> import nlp
>>> data = nlp.load_dataset("cosmos_qa", split="train")
Using custom data configuration default
>>> data.shuffle(keep_in_memory=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vegarab/.conda/envs/torch/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 1398, in shuffle
    verbose=verbose,
  File "/home/vegarab/.conda/envs/torch/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 1178, in select
    ), "Please use either `keep_in_memory` or `cache_file_name` but not both."
AssertionError: Please use either `keep_in_memory` or `cache_file_name` but not both.
>>>data.select([0], keep_in_memory=True)
# No error

@thomwolf
Copy link
Member

Oh yes ok got it thanks. Should be fixed if we are happy with #513 indeed.

@vegarab
Copy link
Contributor Author

vegarab commented Aug 19, 2020

My bad. This is actually not fixed in #513. Sorry about that...
The new indices_cache_file_name is set to a non-None value in the new shuffle() as well.

The buffer and caching mechanisms used in the select() function are too intricate for me to understand why the check is there at all. I've removed it in my local build and it seems to be working fine for my project, without really considering other implications of the change.

@vegarab vegarab changed the title dataset.select(keep_in_memory=True) is never allowed dataset.shuffle(keep_in_memory=True) is never allowed Aug 19, 2020
@thomwolf
Copy link
Member

Ok I'll investigate and add a series of tests on the keep_in_memory=True settings which is under-tested atm

@epwalsh
Copy link

epwalsh commented Jul 23, 2021

Hey, still seeing this issue with the latest version.

@koren-v
Copy link

koren-v commented Sep 19, 2022

The same :(

@mariosasko
Copy link
Collaborator

These are the steps needed to fix this issue:

  1. add the following check to Dataset.shuffle:
if keep_in_memory and indices_cache_file_name is not None:
    raise ValueError("Please use either `keep_in_memory` or `indices_cache_file_name` but not both.")
  1. set indices_cache_file_name to None if keep_in_memory is True in the call to select
  2. add a test with shuffle(keep_in_memory=True)

@Mustapha-AJEGHRIR
Copy link
Contributor

Hi @mariosasko , I have opened this PR #5082

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants