Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calls to map are not cached. #2322

Closed
villmow opened this issue May 5, 2021 · 6 comments
Closed

Calls to map are not cached. #2322

villmow opened this issue May 5, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@villmow
Copy link

villmow commented May 5, 2021

Describe the bug

Somehow caching does not work for me anymore. Am I doing something wrong, or is there anything that I missed?

Steps to reproduce the bug

import datasets
datasets.set_caching_enabled(True)
sst = datasets.load_dataset("sst")

def foo(samples, i):
    print("executed", i[:10])
    return samples

# first call
x = sst.map(foo, batched=True, with_indices=True,  num_proc=2)

print('\n'*3, "#" * 30, '\n'*3)

# second call
y = sst.map(foo, batched=True, with_indices=True, num_proc=2)

# print version
import sys
import platform
print(f"""
- Datasets: {datasets.__version__}
- Python: {sys.version}
- Platform: {platform.platform()}
""")

Actual results

This code prints the following output for me:

No config specified, defaulting to: sst/default
Reusing dataset sst (/home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
#0:   0%|          | 0/5 [00:00<?, ?ba/s]
#1:   0%|          | 0/5 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [4272, 4273, 4274, 4275, 4276, 4277, 4278, 4279, 4280, 4281]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [5272, 5273, 5274, 5275, 5276, 5277, 5278, 5279, 5280, 5281]
executed [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]
executed [6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279, 6280, 6281]
executed [3000, 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009]
executed [7272, 7273, 7274, 7275, 7276, 7277, 7278, 7279, 7280, 7281]
executed [4000, 4001, 4002, 4003, 4004, 4005, 4006, 4007, 4008, 4009]
#0: 100%|██████████| 5/5 [00:00<00:00, 59.85ba/s]
executed [8272, 8273, 8274, 8275, 8276, 8277, 8278, 8279, 8280, 8281]
#1: 100%|██████████| 5/5 [00:00<00:00, 60.85ba/s]
#0:   0%|          | 0/1 [00:00<?, ?ba/s]
#1:   0%|          | 0/1 [00:00<?, ?ba/s]executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
#0: 100%|██████████| 1/1 [00:00<00:00, 69.32ba/s]
executed [551, 552, 553, 554, 555, 556, 557, 558, 559, 560]
#1: 100%|██████████| 1/1 [00:00<00:00, 70.93ba/s]
#0:   0%|          | 0/2 [00:00<?, ?ba/s]
#1:   0%|          | 0/2 [00:00<?, ?ba/s]executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
#0: 100%|██████████| 2/2 [00:00<00:00, 63.25ba/s]
executed [1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114]
executed [2105, 2106, 2107, 2108, 2109, 2110, 2111, 2112, 2113, 2114]
#1: 100%|██████████| 2/2 [00:00<00:00, 57.69ba/s]



 ############################## 



#0:   0%|          | 0/5 [00:00<?, ?ba/s]
#1:   0%|          | 0/5 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [4272, 4273, 4274, 4275, 4276, 4277, 4278, 4279, 4280, 4281]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [5272, 5273, 5274, 5275, 5276, 5277, 5278, 5279, 5280, 5281]
executed [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]
executed [6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279, 6280, 6281]
executed [3000, 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009]
executed [4000, 4001, 4002, 4003, 4004, 4005, 4006, 4007, 4008, 4009]
#0: 100%|██████████| 5/5 [00:00<00:00, 58.10ba/s]
executed [7272, 7273, 7274, 7275, 7276, 7277, 7278, 7279, 7280, 7281]
executed [8272, 8273, 8274, 8275, 8276, 8277, 8278, 8279, 8280, 8281]
#1: 100%|██████████| 5/5 [00:00<00:00, 57.19ba/s]
#0:   0%|          | 0/1 [00:00<?, ?ba/s]
#1:   0%|          | 0/1 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
#0: 100%|██████████| 1/1 [00:00<00:00, 60.10ba/s]
executed [551, 552, 553, 554, 555, 556, 557, 558, 559, 560]
#1: 100%|██████████| 1/1 [00:00<00:00, 53.82ba/s]
#0:   0%|          | 0/2 [00:00<?, ?ba/s]
#1:   0%|          | 0/2 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114]
#0: 100%|██████████| 2/2 [00:00<00:00, 72.76ba/s]
executed [2105, 2106, 2107, 2108, 2109, 2110, 2111, 2112, 2113, 2114]
#1: 100%|██████████| 2/2 [00:00<00:00, 71.55ba/s]

- Datasets: 1.6.1
- Python: 3.8.3 (default, May 19 2020, 18:47:26) 
[GCC 7.3.0]
- Platform: Linux-5.4.0-72-generic-x86_64-with-glibc2.10

Expected results

Caching should work.

@villmow villmow added the bug Something isn't working label May 5, 2021
@villmow
Copy link
Author

villmow commented May 5, 2021

I tried upgrading to datasets==1.6.2 and downgrading to 1.6.0. Both versions produce the same output.

Downgrading to 1.5.0 works and produces the following output for me:

Downloading: 9.20kB [00:00, 3.94MB/s]                   
Downloading: 5.99kB [00:00, 3.29MB/s]                   
No config specified, defaulting to: sst/default
Downloading and preparing dataset sst/default (download: 6.83 MiB, generated: 3.73 MiB, post-processed: Unknown size, total: 10.56 MiB) to /home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/a16a45566b63b2c3179e6c1d0f8edadde56e45570ee8cf99394fbb738491d34b...
                                    Dataset sst downloaded and prepared to /home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/a16a45566b63b2c3179e6c1d0f8edadde56e45570ee8cf99394fbb738491d34b. Subsequent calls will reuse this data.
executed [0, 1]
#0:   0%|          | 0/5 [00:00<?, ?ba/s]
#1:   0%|          | 0/5 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [4272, 4273, 4274, 4275, 4276, 4277, 4278, 4279, 4280, 4281]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
executed [5272, 5273, 5274, 5275, 5276, 5277, 5278, 5279, 5280, 5281]
executed [2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009]
executed [6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279, 6280, 6281]
executed [3000, 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009]
executed [7272, 7273, 7274, 7275, 7276, 7277, 7278, 7279, 7280, 7281]
executed [4000, 4001, 4002, 4003, 4004, 4005, 4006, 4007, 4008, 4009]
#0: 100%|██████████| 5/5 [00:00<00:00, 94.83ba/s]
executed [8272, 8273, 8274, 8275, 8276, 8277, 8278, 8279, 8280, 8281]
#1: 100%|██████████| 5/5 [00:00<00:00, 92.75ba/s]
executed [0, 1]
#0:   0%|          | 0/1 [00:00<?, ?ba/s]
#1:   0%|          | 0/1 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [551, 552, 553, 554, 555, 556, 557, 558, 559, 560]
#0: 100%|██████████| 1/1 [00:00<00:00, 118.81ba/s]
#1: 100%|██████████| 1/1 [00:00<00:00, 123.06ba/s]
executed [0, 1]
#0:   0%|          | 0/2 [00:00<?, ?ba/s]
#1:   0%|          | 0/2 [00:00<?, ?ba/s]
executed [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
executed [1105, 1106, 1107, 1108, 1109, 1110, 1111, 1112, 1113, 1114]
executed [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009]
#0: 100%|██████████| 2/2 [00:00<00:00, 119.42ba/s]
executed [2105, 2106, 2107, 2108, 2109, 2110, 2111, 2112, 2113, 2114]
#1: 100%|██████████| 2/2 [00:00<00:00, 123.33ba/s]



 ############################## 



executed [0, 1]
Loading cached processed dataset at /home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/a16a45566b63b2c3179e6c1d0f8edadde56e45570ee8cf99394fbb738491d34b/cache-6079777aa097c8f8.arrow
Loading cached processed dataset at /home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/a16a45566b63b2c3179e6c1d0f8edadde56e45570ee8cf99394fbb738491d34b/cache-2dc05c46f68eda6e.arrow
executed [0, 1]
Loading cached processed dataset at /home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/a16a45566b63b2c3179e6c1d0f8edadde56e45570ee8cf99394fbb738491d34b/cache-1ca347e7430b98f1.arrow
Loading cached processed dataset at /home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/a16a45566b63b2c3179e6c1d0f8edadde56e45570ee8cf99394fbb738491d34b/cache-c0f1a73ce3ba40cd.arrow
executed [0, 1]
Loading cached processed dataset at /home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/a16a45566b63b2c3179e6c1d0f8edadde56e45570ee8cf99394fbb738491d34b/cache-832a1407bf1ac5b7.arrow
Loading cached processed dataset at /home/johannes/.cache/huggingface/datasets/sst/default/1.0.0/a16a45566b63b2c3179e6c1d0f8edadde56e45570ee8cf99394fbb738491d34b/cache-036316a259b773c4.arrow
- Datasets: 1.5.0
- Python: 3.8.3 (default, May 19 2020, 18:47:26) 
[GCC 7.3.0]
- Platform: Linux-5.4.0-72-generic-x86_64-with-glibc2.10

@mariosasko
Copy link
Collaborator

Hi,

set keep_in_memory to False when loading a dataset (sst = load_dataset("sst", keep_in_memory=False)) to prevent it from loading in-memory. Currently, in-memory datasets fail to find cached files due to this check (always False for them):

if self.cache_files:

@albertvillanova It seems like this behavior was overlooked in #2182.

@albertvillanova
Copy link
Member

Hi @villmow, thanks for reporting.

As @mariosasko has pointed out, we did not consider this case when introducing the feature of automatic in-memory for small datasets. This needs to be fixed.

@lhoestq
Copy link
Member

lhoestq commented May 5, 2021

Hi ! Currently a dataset that is in memory doesn't know doesn't know in which directory it has to read/write cache files.
On the other hand, a dataset that loaded from the disk (via memory mapping) uses the directory from which the dataset is located to read/write cache files.

Because of that, currently in-memory datasets simply don't use caching.

Maybe a Dataset object could have a cache_dir that is set to the directory where the arrow files are created during load_dataset ?

@albertvillanova
Copy link
Member

Fixed once reverted the default in-memory feature:
Closed by #2460 (to close issue #2458).

@albertvillanova
Copy link
Member

albertvillanova commented Jun 8, 2021

Please @villmow, feel free to update to Datasets latest version (1.8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants