Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset loading logger level #1948

Closed
stas00 opened this issue Feb 25, 2021 · 3 comments · Fixed by #6019
Closed

dataset loading logger level #1948

stas00 opened this issue Feb 25, 2021 · 3 comments · Fixed by #6019

Comments

@stas00
Copy link
Contributor

stas00 commented Feb 25, 2021

on master I get this with --dataset_name wmt16 --dataset_config ro-en:

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/stas/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/9dc00622c30446e99c4c63d12a484ea4fb653f2f37c867d6edcec839d7eae50f/cache-2e01bead8cf42e26.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/stas/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/9dc00622c30446e99c4c63d12a484ea4fb653f2f37c867d6edcec839d7eae50f/cache-ac3bebaf4f91f776.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /home/stas/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/9dc00622c30446e99c4c63d12a484ea4fb653f2f37c867d6edcec839d7eae50f/cache-810c3e61259d73a9.arrow

why are those WARNINGs? Should be INFO, no?

warnings should only be used when a user needs to pay attention to something, this is just informative - I'd even say it should be DEBUG, but definitely not WARNING.

Thank you.

@lhoestq
Copy link
Member

lhoestq commented Feb 25, 2021

These warnings are showed when there's a call to .map to say to the user that a dataset is reloaded from the cache instead of being recomputed.
They are warnings since we want to make sure the users know that it's not recomputed.

@stas00
Copy link
Contributor Author

stas00 commented Feb 25, 2021

Thank you for explaining the intention, @lhoestq

  1. Could it be then made more human-friendly? Currently the hex gibberish tells me nothing of what's really going on. e.g. the following is instructive, IMHO:
WARNING: wmt16/ro-en/train dataset was loaded from cache instead of being recomputed
WARNING: wmt16/ro-en/validation dataset was loaded from cache instead of being recomputed
WARNING: wmt16/ro-en/test dataset was loaded from cache instead of being recomputed

note that it removes the not so useful hex info and tells the user instead which split it's referring to - but probably no harm in keeping the path if it helps the debug. But the key is that now the warning is telling me what it is it's warning me about.

Warning:Loading cache path

on the other hand isn't telling what it is warning about.

And I still suggest this is INFO level, otherwise you need to turn all 'using cache' statements to WARNING to be consistent. The user is most likely well aware the cache is used for models, etc. So this feels very similar.

  1. Should there be a way for a user to void warranty by having a flag - I know I'm expecting the cached version to load if it's available - please do not warn me about it=True

To explain the need: Warnings are a problem, they constantly take attention away because they could be the harbinger of a problem. Therefore I prefer not to have any warnings in the log, and if I get any I usually try to deal with those so that my log is clean.

It's less of an issue for somebody doing long runs. It's a huge issue for someone who does a new run every few minutes and on the lookout for any potential problems which is what I have been doing a lot of integrating DeepSpeed and other things. And since there are already problems to deal with during the integration it's nice to have a clean log to start with.

I hope my need is not unreasonable and I was able to explain it adequately.

Thank you.

@yuvalkirstain
Copy link

Hey, any news about the issue? So many warnings when I'm really ok with the dataset not being recomputed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants