Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasets 1.6 ignores cache #2387

Closed
stas00 opened this issue May 21, 2021 · 13 comments · Fixed by #2399
Closed

datasets 1.6 ignores cache #2387

stas00 opened this issue May 21, 2021 · 13 comments · Fixed by #2399
Assignees
Labels
bug Something isn't working

Comments

@stas00
Copy link
Contributor

stas00 commented May 21, 2021

Moving from huggingface/transformers#11801 (comment)

Quoting @VictorSanh:

I downgraded datasets to 1.5.0 and printed tokenized_datasets.cache_files (L335):

{'train': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-c6aefe81ca4e5152.arrow'}], 'validation': [{'filename': '/home/victor/.cache/huggingface/datasets/openwebtext10k/plain_text/1.0.0/3a8df094c671b4cb63ed0b41f40fb3bd855e9ce2e3765e5df50abcdfb5ec144b/cache-97cf4c813e6469c6.arrow'}]}

while the same command with the latest version of datasets (actually starting at 1.6.0) gives:

{'train': [], 'validation': []}

I also confirm that downgrading to datasets==1.5.0 makes things fast again - i.e. cache is used.

to reproduce:

USE_TF=0 python  examples/pytorch/language-modeling/run_clm.py \
    --model_name_or_path gpt2 \
    --dataset_name "stas/openwebtext-10k" \
    --output_dir output_dir \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --max_train_samples 1000 \
    --max_eval_samples 200 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --num_train_epochs 1 \
    --warmup_steps 8 \
    --block_size 64 \
    --fp16 \
    --report_to none

the first time the startup is slow and some 5 tqdm bars. It shouldn't do it on consequent runs. but with datasets>1.5.0 it rebuilds on every run.

@lhoestq

@bhavitvyamalik
Copy link
Contributor

Looks like there are multiple issues regarding this (#2386, #2322) and it's a WIP #2329. Currently these datasets are being loaded in-memory which is causing this issue. Quoting @mariosasko here for a quick fix:

set keep_in_memory to False when loading a dataset (sst = load_dataset("sst", keep_in_memory=False)) to prevent it from loading in-memory. Currently, in-memory datasets fail to find cached files due to this check (always False for them)

@lhoestq
Copy link
Member

lhoestq commented May 21, 2021

Hi ! Since datasets 1.6.0 we no longer keep small datasets (<250MB) on disk and load them in RAM instead by default. This makes data processing and iterating on data faster. However datasets in RAM currently have no way to reload previous results from the cache (since nothing is written on disk). We are working on making the caching work for datasets in RAM.

Until then, I'd recommend passing keep_in_memory=False to the calls to load_dataset like here:

https://github.com/huggingface/transformers/blob/223943872e8c9c3fc11db3c6e93da07f5177423f/examples/pytorch/language-modeling/run_clm.py#L233

This way you say explicitly that you want your dataset to stay on the disk, and it will be able to recover previously computed results from the cache.

@VictorSanh
Copy link
Contributor

gotcha! thanks Quentin

@stas00 stas00 reopened this May 21, 2021
@stas00
Copy link
Contributor Author

stas00 commented May 21, 2021

OK, It doesn't look like we can use the proposed workaround - see huggingface/transformers#11801

Could you please add an env var for us to be able to turn off this unwanted in our situation behavior? It is really problematic for dev work, when one needs to restart the training very often and needs a quick startup time. Manual editing of standard scripts is not a practical option when one uses examples.

This could also be a problem for tests, which will be slower because of lack of cache, albeit usually we use tiny datasets there. I think we want caching for tests.

Thank you.

Mehrad0711 added a commit to stanford-oval/genienlp that referenced this issue May 21, 2021
@albertvillanova
Copy link
Member

albertvillanova commented May 24, 2021

Hi @stas00,

You are right: an env variable is needed to turn off this behavior. I am adding it.

For the moment there is a config parameter to turn off this behavior: datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES = None

You can find this info in the docs:

The default in 🤗Datasets is to memory-map the dataset on drive if its size is larger than datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES (default 250 MiB); otherwise, the dataset is copied in-memory. This behavior can be disabled by setting datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES = None, and in this case the dataset is not loaded in memory.

@stas00
Copy link
Contributor Author

stas00 commented May 24, 2021

Yes, but this still requires one to edit the standard example scripts, so if I'm doing that already I just as well can add keep_in_memory=False.

May be the low hanging fruit is to add MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES env var to match the config, and if the user sets it to 0, then it'll be the same as keep_in_memory=False or datasets.config.MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=0?

@albertvillanova
Copy link
Member

@stas00, however, for the moment, setting the value to 0 is equivalent to the opposite, i.e. keep_in_memory=True. This means the max size until which I load in memory is 0 bytes.

Tell me if this is logical/convenient, or I should change it.

@albertvillanova
Copy link
Member

In my PR, to turn off current default bahavior, you should set env variable to one of: {"", "OFF", "NO", "FALSE"}.

For example:

MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=

@stas00
Copy link
Contributor Author

stas00 commented May 24, 2021

IMHO, this behaviour is not very intuitive, as 0 is a normal quantity of bytes. So MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=0 to me reads as don't cache ever.

Also "SIZE_IN_BYTES" that can take one of {"", "OFF", "NO", "FALSE"} is also quite odd.

I think supporting a very simple MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES that can accept any numerical value to match the name of the variable, requires minimal logic and is very straightforward.

So if you could adjust this logic - then MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=0 is all that's needed to not do in-memory datasets.

Does it make sense?

@albertvillanova
Copy link
Member

albertvillanova commented May 24, 2021

I understand your point @stas00, as I am not very convinced with current implementation.

My concern is: which numerical value should then pass a user who wants keep_in_memory=True by default, independently of dataset size? Currently it is 0 for this case.

@stas00
Copy link
Contributor Author

stas00 commented May 24, 2021

That's a good question, and again the normal bytes can be used for that:

MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES=1e12 # (~2**40)

Since it's unlikely that anybody will have more than 1TB RAM.

It's also silly that it uses BYTES and not MBYTES - that level of refinement doesn't seem to be of a practical use in this context.

Not sure when it was added and if there are back-compat issues here, but perhaps it could be renamed MAX_IN_MEMORY_DATASET_SIZE and support 1M, 1G, 1T, etc.

But scientific notation is quite intuitive too, as each 000 zeros is the next M, G, T multiplier. Minus the discrepancy of 1024 vs 1000, which adds up. And it is easy to write down 1e12, as compared to 1099511627776 (2**40). (1.1e12 is more exact).

@albertvillanova
Copy link
Member

Great! Thanks, @stas00.

I am implementing your suggestion to turn off default value when set to 0.

For the other suggestion (allowing different metric prefixes), I will discuss with @lhoestq to agree on its implementation.

@stas00
Copy link
Contributor Author

stas00 commented May 24, 2021

Awesome! Thank you, @albertvillanova!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants