dataset.shuffle(keep_in_memory=True) is never allowed #514

vegarab · 2020-08-18T18:47:40Z

As of commit ef4aac2, the usage of the parameter keep_in_memory=True is never possible: dataset.select(keep_in_memory=True)

The commit added the lines

# lines 994-996 in src/nlp/arrow_dataset.py
       assert (
            not keep_in_memory or cache_file_name is None
        ), "Please use either `keep_in_memory` or `cache_file_name` but not both."

This affects both shuffle() as select() is a sub-routine, and map() that has the same check.

I'd love to fix this myself, but unsure what the intention of the assert is given the rest of the logic in the function concerning ccache_file_name and keep_in_memory.

The text was updated successfully, but these errors were encountered:

vegarab · 2020-08-18T19:08:26Z

This seems to be fixed in #513 for the filter function, replacing cache_file_name with indices_cache_file_name in the assert. Although not for the map() function @thomwolf

thomwolf · 2020-08-18T19:13:14Z

Maybe I'm a bit tired but I fail to see the issue here.

Since cache_file_name is None by default, if you set keep_in_memory to True, the assert should pass, no?

vegarab · 2020-08-18T19:21:40Z

I failed to realise that this only applies to shuffle(). Whenever keep_in_memory is set to True, this is passed on to the select() function. However, if cache_file_name is None, it will be defined in the shuffle() function before it is passed on to select().

Thus, select() is called with keep_in_memory=True and a not None value for cache_file_name.
This is essentially fixed in #513

Easily reproducible:

>>> import nlp
>>> data = nlp.load_dataset("cosmos_qa", split="train")
Using custom data configuration default
>>> data.shuffle(keep_in_memory=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vegarab/.conda/envs/torch/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 1398, in shuffle
    verbose=verbose,
  File "/home/vegarab/.conda/envs/torch/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 1178, in select
    ), "Please use either `keep_in_memory` or `cache_file_name` but not both."
AssertionError: Please use either `keep_in_memory` or `cache_file_name` but not both.
>>>data.select([0], keep_in_memory=True)
# No error

thomwolf · 2020-08-18T19:44:29Z

Oh yes ok got it thanks. Should be fixed if we are happy with #513 indeed.

vegarab · 2020-08-19T11:36:02Z

My bad. This is actually not fixed in #513. Sorry about that...
The new indices_cache_file_name is set to a non-None value in the new shuffle() as well.

The buffer and caching mechanisms used in the select() function are too intricate for me to understand why the check is there at all. I've removed it in my local build and it seems to be working fine for my project, without really considering other implications of the change.

thomwolf · 2020-08-19T12:18:46Z

Ok I'll investigate and add a series of tests on the keep_in_memory=True settings which is under-tested atm

epwalsh · 2021-07-23T18:07:10Z

Hey, still seeing this issue with the latest version.

koren-v · 2022-09-19T03:18:40Z

The same :(

mariosasko · 2022-10-04T17:43:13Z

These are the steps needed to fix this issue:

add the following check to Dataset.shuffle:

if keep_in_memory and indices_cache_file_name is not None:
    raise ValueError("Please use either `keep_in_memory` or `indices_cache_file_name` but not both.")

set indices_cache_file_name to None if keep_in_memory is True in the call to select
add a test with shuffle(keep_in_memory=True)

Mustapha-AJEGHRIR · 2022-10-06T11:33:03Z

Hi @mariosasko , I have opened this PR #5082

vegarab closed this as completed Aug 18, 2020

vegarab reopened this Aug 18, 2020

vegarab changed the title ~~dataset.select(keep_in_memory=True) is never allowed~~ dataset.shuffle(keep_in_memory=True) is never allowed Aug 19, 2020

mariosasko added good first issue Good for newcomers hacktoberfest labels Oct 4, 2022

Mustapha-AJEGHRIR mentioned this issue Oct 6, 2022

adding keep in memory #5082

Merged

mariosasko closed this as completed Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset.shuffle(keep_in_memory=True) is never allowed #514

dataset.shuffle(keep_in_memory=True) is never allowed #514

vegarab commented Aug 18, 2020

vegarab commented Aug 18, 2020 •

edited

Loading

thomwolf commented Aug 18, 2020

vegarab commented Aug 18, 2020 •

edited

Loading

thomwolf commented Aug 18, 2020

vegarab commented Aug 19, 2020

thomwolf commented Aug 19, 2020

epwalsh commented Jul 23, 2021

koren-v commented Sep 19, 2022

mariosasko commented Oct 4, 2022

Mustapha-AJEGHRIR commented Oct 6, 2022

dataset.shuffle(keep_in_memory=True) is never allowed #514

dataset.shuffle(keep_in_memory=True) is never allowed #514

Comments

vegarab commented Aug 18, 2020

vegarab commented Aug 18, 2020 • edited Loading

thomwolf commented Aug 18, 2020

vegarab commented Aug 18, 2020 • edited Loading

thomwolf commented Aug 18, 2020

vegarab commented Aug 19, 2020

thomwolf commented Aug 19, 2020

epwalsh commented Jul 23, 2021

koren-v commented Sep 19, 2022

mariosasko commented Oct 4, 2022

Mustapha-AJEGHRIR commented Oct 6, 2022

vegarab commented Aug 18, 2020 •

edited

Loading

vegarab commented Aug 18, 2020 •

edited

Loading