-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving large in-memory datasets with save_to_disk crashes because of pickling #2134
Comments
Hi ! import pyarrow as pa
import pickle
arr = pa.array([0] * ((4 * 8 << 30) // 64))
table = pa.Table.from_arrays([a], names=["foo"])
pickle.dumps(table) # fails with an OverflowError
pickle.dumps(table, 4) # works ! We'll do the change to use Moreover I've also seen other users complain about this error
It looks like something related to the 4GB limit as well but I'm not able to reproduce on my side. |
Hi! from datasets import Dataset
if __name__ == '__main__':
ton_of_zeroes = [0] * ((12 * 8 << 30) // 64)
large_dataset = Dataset.from_dict({'col': ton_of_zeroes})
print("Start")
large_dataset.map(function=None, num_proc=2)
print("Done - should not print") The amount of zeros could probably be reduced, I haven't tried to minimize it to find the breaking point, I just increased it from your code (which by quick glance I assumed tried to allocate over 4 GiB) Running this results in the following traceback:
My datasets usually have hundreds of thousands to low millions of rows, with each row containing a list of 10 strings and list of vectors of different length (the strings tokenized), which in the worst case have 10*512*8 = 40960 bytes (but usually it is much smaller, as the vectors tend to be shorter. I need these groups of text lines to create training data for the Inverse Cloze Task. Anyway I don't think my particular dataset is relevant, as the tiny script I created also manages to crash. |
I just merged a fix #2150 that allows to pickle tables bigger than 4GiB |
awesome! I started getting this error as well when I tried to tokenize with a longer sequence length |
@prokopCerny does this fix work for you? I found that with the latest master, my container with 500GB RAM starts crashing when I try to map a large dataset using @lhoestq would it be possible to implement some logic to keep the individual cache files small (say below 100mb)? I find this helps with loading large datasets, but the "hack" I was using (increasing |
Closing since the original issue was fixed in #2150 |
Using Datasets 1.5.0 on Python 3.7.
Recently I've been working on medium to large size datasets (pretokenized raw text sizes from few gigabytes to low tens of gigabytes), and have found out that several preprocessing steps are massively faster when done in memory, and I have the ability to requisition a lot of RAM, so I decided to do these steps completely out of the datasets library.
So my workflow is to do several .map() on datasets object, then for the operation which is faster in memory to extract the necessary columns from the dataset and then drop it whole, do the transformation in memory, and then create a fresh Dataset object using .from_dict() or other method.
When I then try to call save_to_disk(path) on the dataset, it crashes because of pickling, which appears to be because of using old pickle protocol which doesn't support large files (over 4 GiB).
From what I've seen this issue may be possibly fixed, as the line
self = pickle.loads(pickle.dumps(self))
does not appear to be present in the current state of the repository.To save these datasets to disk, I've resorted to calling .map() over them with
function=None
and specifying the .arrow cache file, and then creating a new dataset using the .from_file() method, which I can then safely save to disk.Additional issue when working with these large in-memory datasets is when using multiprocessing, is again to do with pickling. I've tried to speed up the mapping with function=None by specifying num_proc to the available cpu count, and I again get issues with transferring the dataset, with the following traceback. I am not sure if I should open a separate issue for that.
The text was updated successfully, but these errors were encountered: