-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datasets 1.6 ignores cache #2387
Comments
Looks like there are multiple issues regarding this (#2386, #2322) and it's a WIP #2329. Currently these datasets are being loaded in-memory which is causing this issue. Quoting @mariosasko here for a quick fix:
|
Hi ! Since Until then, I'd recommend passing This way you say explicitly that you want your dataset to stay on the disk, and it will be able to recover previously computed results from the cache. |
gotcha! thanks Quentin |
OK, It doesn't look like we can use the proposed workaround - see huggingface/transformers#11801 Could you please add an env var for us to be able to turn off this unwanted in our situation behavior? It is really problematic for dev work, when one needs to restart the training very often and needs a quick startup time. Manual editing of standard scripts is not a practical option when one uses examples. This could also be a problem for tests, which will be slower because of lack of cache, albeit usually we use tiny datasets there. I think we want caching for tests. Thank you. |
Hi @stas00, You are right: an env variable is needed to turn off this behavior. I am adding it. For the moment there is a config parameter to turn off this behavior: You can find this info in the docs:
|
Yes, but this still requires one to edit the standard example scripts, so if I'm doing that already I just as well can add May be the low hanging fruit is to add |
@stas00, however, for the moment, setting the value to Tell me if this is logical/convenient, or I should change it. |
In my PR, to turn off current default bahavior, you should set env variable to one of: For example:
|
IMHO, this behaviour is not very intuitive, as 0 is a normal quantity of bytes. So Also "SIZE_IN_BYTES" that can take one of I think supporting a very simple So if you could adjust this logic - then Does it make sense? |
I understand your point @stas00, as I am not very convinced with current implementation. My concern is: which numerical value should then pass a user who wants |
That's a good question, and again the normal bytes can be used for that:
Since it's unlikely that anybody will have more than 1TB RAM. It's also silly that it uses BYTES and not MBYTES - that level of refinement doesn't seem to be of a practical use in this context. Not sure when it was added and if there are back-compat issues here, but perhaps it could be renamed But scientific notation is quite intuitive too, as each 000 zeros is the next M, G, T multiplier. Minus the discrepancy of 1024 vs 1000, which adds up. And it is easy to write down |
Awesome! Thank you, @albertvillanova!!! |
Moving from huggingface/transformers#11801 (comment)
Quoting @VictorSanh:
I also confirm that downgrading to
datasets==1.5.0
makes things fast again - i.e. cache is used.to reproduce:
the first time the startup is slow and some 5 tqdm bars. It shouldn't do it on consequent runs. but with
datasets>1.5.0
it rebuilds on every run.@lhoestq
The text was updated successfully, but these errors were encountered: