datasets 1.6 ignores cache #2387

stas00 · 2021-05-21T00:12:58Z

Moving from huggingface/transformers#11801 (comment)

I downgraded datasets to 1.5.0 and printed tokenized_datasets.cache_files (L335):

{'train': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-c6aefe81ca4e5152.arrow'}], 'validation': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-97cf4c813e6469c6.arrow'}]}

while the same command with the latest version of datasets (actually starting at 1.6.0) gives:

{'train': [], 'validation': []}

I also confirm that downgrading to datasets==1.5.0 makes things fast again - i.e. cache is used.

to reproduce:

USE_TF=0 python  examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name "stas/openwebtext-10k" \
    --output_dir output_dir \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --max_train_samples 1000 \
    --max_eval_samples 200 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --num_train_epochs 1 \
    --warmup_steps 8 \
    --block_size 64 \
    --fp16 \
    --report_to none

the first time the startup is slow and some 5 tqdm bars. It shouldn't do it on consequent runs. but with datasets>1.5.0 it rebuilds on every run.

@lhoestq

The text was updated successfully, but these errors were encountered:

bhavitvyamalik · 2021-05-21T07:22:38Z

Looks like there are multiple issues regarding this (#2386, #2322) and it's a WIP #2329. Currently these datasets are being loaded in-memory which is causing this issue. Quoting @mariosasko here for a quick fix:

set keep_in_memory to False when loading a dataset (sst = load_dataset("sst", keep_in_memory=False)) to prevent it from loading in-memory. Currently, in-memory datasets fail to find cached files due to this check (always False for them)

lhoestq · 2021-05-21T08:35:06Z

Hi ! Since datasets 1.6.0 we no longer keep small datasets (<250MB) on disk and load them in RAM instead by default. This makes data processing and iterating on data faster. However datasets in RAM currently have no way to reload previous results from the cache (since nothing is written on disk). We are working on making the caching work for datasets in RAM.

Until then, I'd recommend passing keep_in_memory=False to the calls to load_dataset like here:

https://github.com/huggingface/transformers/blob/223943872e8c9c3fc11db3c6e93da07f5177423f/examples/pytorch/language-modeling/run_clm.py#L233

This way you say explicitly that you want your dataset to stay on the disk, and it will be able to recover previously computed results from the cache.

VictorSanh · 2021-05-21T13:34:27Z

gotcha! thanks Quentin

stas00 · 2021-05-21T16:14:08Z

OK, It doesn't look like we can use the proposed workaround - see huggingface/transformers#11801

Could you please add an env var for us to be able to turn off this unwanted in our situation behavior? It is really problematic for dev work, when one needs to restart the training very often and needs a quick startup time. Manual editing of standard scripts is not a practical option when one uses examples.

This could also be a problem for tests, which will be slower because of lack of cache, albeit usually we use tiny datasets there. I think we want caching for tests.

Thank you.

see huggingface/datasets#2387

albertvillanova · 2021-05-24T16:40:02Z

Hi @stas00,

You are right: an env variable is needed to turn off this behavior. I am adding it.

For the moment there is a config parameter to turn off this behavior: datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES = None

You can find this info in the docs:

in the docstring of the parameter keep_in_memory of the function load_datasets:
in a Note in the docs about Loading a Dataset

The default in 🤗Datasets is to memory-map the dataset on drive if its size is larger than datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES (default 250 MiB); otherwise, the dataset is copied in-memory. This behavior can be disabled by setting datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES = None, and in this case the dataset is not loaded in memory.

stas00 · 2021-05-24T17:19:24Z

Yes, but this still requires one to edit the standard example scripts, so if I'm doing that already I just as well can add keep_in_memory=False.

May be the low hanging fruit is to add MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES env var to match the config, and if the user sets it to 0, then it'll be the same as keep_in_memory=False or datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=0?

albertvillanova · 2021-05-24T17:22:45Z

@stas00, however, for the moment, setting the value to 0 is equivalent to the opposite, i.e. keep_in_memory=True. This means the max size until which I load in memory is 0 bytes.

Tell me if this is logical/convenient, or I should change it.

albertvillanova · 2021-05-24T17:25:19Z

In my PR, to turn off current default bahavior, you should set env variable to one of: {"", "OFF", "NO", "FALSE"}.

For example:

MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=

stas00 · 2021-05-24T17:29:49Z

IMHO, this behaviour is not very intuitive, as 0 is a normal quantity of bytes. So MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=0 to me reads as don't cache ever.

Also "SIZE_IN_BYTES" that can take one of {"", "OFF", "NO", "FALSE"} is also quite odd.

I think supporting a very simple MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES that can accept any numerical value to match the name of the variable, requires minimal logic and is very straightforward.

So if you could adjust this logic - then MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=0 is all that's needed to not do in-memory datasets.

Does it make sense?

albertvillanova · 2021-05-24T17:36:40Z

I understand your point @stas00, as I am not very convinced with current implementation.

My concern is: which numerical value should then pass a user who wants keep_in_memory=True by default, independently of dataset size? Currently it is 0 for this case.

stas00 · 2021-05-24T17:46:05Z

That's a good question, and again the normal bytes can be used for that:

MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=1e12 # (~2**40)

Since it's unlikely that anybody will have more than 1TB RAM.

It's also silly that it uses BYTES and not MBYTES - that level of refinement doesn't seem to be of a practical use in this context.

Not sure when it was added and if there are back-compat issues here, but perhaps it could be renamed MAX_IN_MEMORY_DATASET_SIZE and support 1M, 1G, 1T, etc.

But scientific notation is quite intuitive too, as each 000 zeros is the next M, G, T multiplier. Minus the discrepancy of 1024 vs 1000, which adds up. And it is easy to write down 1e12, as compared to 1099511627776 (2**40). (1.1e12 is more exact).

albertvillanova · 2021-05-24T18:06:48Z

Great! Thanks, @stas00.

I am implementing your suggestion to turn off default value when set to 0.

For the other suggestion (allowing different metric prefixes), I will discuss with @lhoestq to agree on its implementation.

stas00 · 2021-05-24T18:28:11Z

Awesome! Thank you, @albertvillanova!!!

stas00 added the bug Something isn't working label May 21, 2021

stas00 mentioned this issue May 21, 2021

[examples] run_clm re-processes dataset on every run huggingface/transformers#11801

Closed

VictorSanh closed this as completed May 21, 2021

stas00 reopened this May 21, 2021

Mehrad0711 added a commit to stanford-oval/genienlp that referenced this issue May 21, 2021

Workaround to access cache_files

439a785

see huggingface/datasets#2387

albertvillanova self-assigned this May 24, 2021

albertvillanova mentioned this issue May 24, 2021

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399

Merged

lhoestq closed this as completed in #2399 May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets 1.6 ignores cache #2387

datasets 1.6 ignores cache #2387

stas00 commented May 21, 2021 •

edited

Loading

bhavitvyamalik commented May 21, 2021

lhoestq commented May 21, 2021

VictorSanh commented May 21, 2021

stas00 commented May 21, 2021 •

edited

Loading

albertvillanova commented May 24, 2021 •

edited

Loading

stas00 commented May 24, 2021

albertvillanova commented May 24, 2021

albertvillanova commented May 24, 2021

stas00 commented May 24, 2021

albertvillanova commented May 24, 2021 •

edited

Loading

stas00 commented May 24, 2021 •

edited

Loading

albertvillanova commented May 24, 2021

stas00 commented May 24, 2021

datasets 1.6 ignores cache #2387

datasets 1.6 ignores cache #2387

Comments

stas00 commented May 21, 2021 • edited Loading

bhavitvyamalik commented May 21, 2021

lhoestq commented May 21, 2021

VictorSanh commented May 21, 2021

stas00 commented May 21, 2021 • edited Loading

albertvillanova commented May 24, 2021 • edited Loading

stas00 commented May 24, 2021

albertvillanova commented May 24, 2021

albertvillanova commented May 24, 2021

stas00 commented May 24, 2021

albertvillanova commented May 24, 2021 • edited Loading

stas00 commented May 24, 2021 • edited Loading

albertvillanova commented May 24, 2021

stas00 commented May 24, 2021

stas00 commented May 21, 2021 •

edited

Loading

stas00 commented May 21, 2021 •

edited

Loading

albertvillanova commented May 24, 2021 •

edited

Loading

albertvillanova commented May 24, 2021 •

edited

Loading

stas00 commented May 24, 2021 •

edited

Loading