Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving large in-memory datasets with save_to_disk crashes because of pickling #2134

Closed
prokopCerny opened this issue Mar 29, 2021 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@prokopCerny
Copy link

prokopCerny commented Mar 29, 2021

Using Datasets 1.5.0 on Python 3.7.
Recently I've been working on medium to large size datasets (pretokenized raw text sizes from few gigabytes to low tens of gigabytes), and have found out that several preprocessing steps are massively faster when done in memory, and I have the ability to requisition a lot of RAM, so I decided to do these steps completely out of the datasets library.

So my workflow is to do several .map() on datasets object, then for the operation which is faster in memory to extract the necessary columns from the dataset and then drop it whole, do the transformation in memory, and then create a fresh Dataset object using .from_dict() or other method.

When I then try to call save_to_disk(path) on the dataset, it crashes because of pickling, which appears to be because of using old pickle protocol which doesn't support large files (over 4 GiB).

Traceback (most recent call last):
  File "./tokenize_and_chunkify_in_memory.py", line 80, in <module>
    main()
  File "./tokenize_and_chunkify_in_memory.py", line 75, in main
    tokenize_and_chunkify(config)
  File "./tokenize_and_chunkify_in_memory.py", line 60, in tokenize_and_chunkify
    contexts_dataset.save_to_disk(chunked_path)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 457, in save_to_disk
    self = pickle.loads(pickle.dumps(self))
OverflowError: cannot serialize a bytes object larger than 4 GiB

From what I've seen this issue may be possibly fixed, as the line self = pickle.loads(pickle.dumps(self)) does not appear to be present in the current state of the repository.

To save these datasets to disk, I've resorted to calling .map() over them with function=None and specifying the .arrow cache file, and then creating a new dataset using the .from_file() method, which I can then safely save to disk.

Additional issue when working with these large in-memory datasets is when using multiprocessing, is again to do with pickling. I've tried to speed up the mapping with function=None by specifying num_proc to the available cpu count, and I again get issues with transferring the dataset, with the following traceback. I am not sure if I should open a separate issue for that.

Traceback (most recent call last):
  File "./tokenize_and_chunkify_in_memory.py", line 94, in <module>
    main()
  File "./tokenize_and_chunkify_in_memory.py", line 89, in main
    tokenize_and_chunkify(config)
  File "./tokenize_and_chunkify_in_memory.py", line 67, in tokenize_and_chunkify
    contexts_dataset.map(function=None, cache_file_name=str(output_dir_path / "tmp.arrow"), writer_batch_size=50000, num_proc=config.threads)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1485, in map
    transformed_shards = [r.get() for r in results]
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1485, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/pool.py", line 657, in get
    raise self._value
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/pool.py", line 431, in _handle_tasks
    put(task)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/dill/_dill.py", line 454, in dump
    StockPickler.dump(self, obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 662, in save_reduce
    save(state)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 843, in _batch_appends
    save(x)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 846, in _batch_appends
    save(tmp[0])
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 846, in _batch_appends
    save(tmp[0])
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 846, in _batch_appends
    save(tmp[0])
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 843, in _batch_appends
    save(x)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 732, in save_bytes
    self._write_large_bytes(BINBYTES + pack("<I", n), obj)
struct.error: 'I' format requires 0 <= number <= 4294967295Traceback (most recent call last):
  File "./tokenize_and_chunkify_in_memory.py", line 94, in <module>
    main()
  File "./tokenize_and_chunkify_in_memory.py", line 89, in main
    tokenize_and_chunkify(config)
  File "./tokenize_and_chunkify_in_memory.py", line 67, in tokenize_and_chunkify
    contexts_dataset.map(function=None, cache_file_name=str(output_dir_path / "tmp.arrow"), writer_batch_size=50000, num_proc=config.threads)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1485, in map
    transformed_shards = [r.get() for r in results]
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1485, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/pool.py", line 657, in get
    raise self._value
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/pool.py", line 431, in _handle_tasks
    put(task)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/dill/_dill.py", line 454, in dump
    StockPickler.dump(self, obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 662, in save_reduce
    save(state)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 843, in _batch_appends
    save(x)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 846, in _batch_appends
    save(tmp[0])
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 846, in _batch_appends
    save(tmp[0])
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 846, in _batch_appends
    save(tmp[0])
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 843, in _batch_appends
    save(x)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 732, in save_bytes
    self._write_large_bytes(BINBYTES + pack("<I", n), obj)
struct.error: 'I' format requires 0 <= number <= 4294967295
@prokopCerny prokopCerny changed the title Working with large in-memory datasets Working with large in-memory datasets crashes because of pickling Mar 29, 2021
@prokopCerny prokopCerny changed the title Working with large in-memory datasets crashes because of pickling Saving large in-memory datasets with save_to_disk crashes because of pickling Mar 29, 2021
@lhoestq
Copy link
Member

lhoestq commented Mar 29, 2021

Hi !
Indeed save_to_disk doesn't call pickle anymore. Though the OverflowError can still appear for in-memory datasets bigger than 4GB. This happens when doing this for example:

import pyarrow as pa
import pickle

arr = pa.array([0] * ((4 * 8 << 30) // 64))
table = pa.Table.from_arrays([a], names=["foo"])
pickle.dumps(table)  # fails with an OverflowError
pickle.dumps(table, 4)  # works !

We'll do the change to use protocol=4.

Moreover I've also seen other users complain about this error

struct.error: 'I' format requires 0 <= number <= 4294967295

It looks like something related to the 4GB limit as well but I'm not able to reproduce on my side.
Do you think you can provide a script that reproduces the issue ?
How big is your dataset ? (number of bytes, number of rows)

@prokopCerny
Copy link
Author

prokopCerny commented Mar 30, 2021

Hi!
So I've managed to created a minimum working (well technically crashing) example for the multiprocessing case, I create a huge list of zeros, like in your example, and then I try to .map(None, num_proc=2) over it, which then crashes, here's the code:

from datasets import  Dataset

if __name__ == '__main__':
    ton_of_zeroes = [0] * ((12 * 8 << 30) // 64)
    large_dataset = Dataset.from_dict({'col': ton_of_zeroes})
    print("Start")
    large_dataset.map(function=None, num_proc=2)
    print("Done - should not print")

The amount of zeros could probably be reduced, I haven't tried to minimize it to find the breaking point, I just increased it from your code (which by quick glance I assumed tried to allocate over 4 GiB)

Running this results in the following traceback:

Parameter 'indices'=[        0         1         2 ... 805306365 805306366 805306367] of the transform datasets.arrow_dataset.Dataset.select couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
Traceback (most recent call last):
  File "./crash_multiproc_pickle.py", line 7, in <module>
    large_dataset.map(function=None, num_proc=2)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1485, in map
    transformed_shards = [r.get() for r in results]
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 1485, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/pool.py", line 657, in get
    raise self._value
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/pool.py", line 431, in _handle_tasks
    put(task)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/connection.py", line 209, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/dill/_dill.py", line 454, in dump
    StockPickler.dump(self, obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 662, in save_reduce
    save(state)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/cernypro/dev/envs/huggingface_gpu/lib/python3.7/site-packages/dill/_dill.py", line 941, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 859, in save_dict
    self._batch_setitems(obj.items())
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 885, in _batch_setitems
    save(v)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 846, in _batch_appends
    save(tmp[0])
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 846, in _batch_appends
    save(tmp[0])
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 789, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 819, in save_list
    self._batch_appends(obj)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 843, in _batch_appends
    save(x)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 638, in save_reduce
    save(args)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 774, in save_tuple
    save(element)
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/mnt/appl/software/Python/3.7.4-GCCcore-8.3.0/lib/python3.7/pickle.py", line 732, in save_bytes
    self._write_large_bytes(BINBYTES + pack("<I", n), obj)
struct.error: 'I' format requires 0 <= number <= 4294967295

My datasets usually have hundreds of thousands to low millions of rows, with each row containing a list of 10 strings and list of vectors of different length (the strings tokenized), which in the worst case have 10*512*8 = 40960 bytes (but usually it is much smaller, as the vectors tend to be shorter. I need these groups of text lines to create training data for the Inverse Cloze Task.

Anyway I don't think my particular dataset is relevant, as the tiny script I created also manages to crash.
But I think the issue is the same as the save_to_disk, from the traceback it seems that in multiprocessing, it tries to use dill to return the result of the map workers, which tries to pickle the data and can't do it, probably because it's again using the older pickle protocol. That's my guess anyway.

@lhoestq lhoestq added the bug Something isn't working label Mar 30, 2021
@lhoestq lhoestq self-assigned this Mar 30, 2021
@lhoestq
Copy link
Member

lhoestq commented Mar 31, 2021

I just merged a fix #2150 that allows to pickle tables bigger than 4GiB
Feel free to try it on the master branch !

@samsontmr
Copy link

awesome! I started getting this error as well when I tried to tokenize with a longer sequence length

@samsontmr
Copy link

@prokopCerny does this fix work for you? I found that with the latest master, my container with 500GB RAM starts crashing when I try to map a large dataset using num_proc.

@lhoestq would it be possible to implement some logic to keep the individual cache files small (say below 100mb)? I find this helps with loading large datasets, but the "hack" I was using (increasing num_proc to a large number) doesn't work anymore with the latest master; my container crashes even with num_proc=200 now

@lhoestq
Copy link
Member

lhoestq commented May 3, 2021

Closing since the original issue was fixed in #2150
Feel free to reopen if you are still experiencing it.
For the other problems, please open separate issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants