Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Multi-process data loader bug with TensorField (RuntimeError: received 0 items of ancdata) #4847

Closed
epwalsh opened this issue Dec 7, 2020 · 2 comments
Labels

Comments

@epwalsh
Copy link
Member

epwalsh commented Dec 7, 2020

I discovered this issue while using the new MultiprocessDataLoader with num_workers > 0 and max_instances_in_memory set to some high number (1000 in my case) to load batches that are built with instances that contain TensorFields.

  ...
  File "/home/epwalsh/AllenAI/allennlp/allennlp/data/data_loaders/multi_process_data_loader.py", line 236, in __iter__
    yield from self._iter_batches()
  File "/home/epwalsh/AllenAI/allennlp/allennlp/data/data_loaders/multi_process_data_loader.py", line 421, in _iter_batches
    raise e
RuntimeError: received 0 items of ancdata

The issue is stems from the fact that tensors are passed between processes using shared memory, but some systems (like the one I was on) may have strict limits on shared memory by default. So if you pile too many tensors into shared memory by having max_instances_in_memory too high, you're going to run into this. pytorch/pytorch#973 (comment).

Luckily the solution is simple: either decrease max_instances_in_memory (bringing it down to 100 worked in my case), or increase the shared memory available to your training process.

@github-actions
Copy link

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

@Vimos
Copy link

Vimos commented Apr 11, 2022

Similar issue here when using several workers for the loader,

        label = example.get('label')
        if label is not None:
            fields['label'] = TensorField(np.array(label))

If I comment out the fields['label'], then the loading will be successful.

In my case, changing the multiprocessing strategy of pytorch can also resovle the issue, torch.multiprocessing.set_sharing_strategy("file_system").

However, I suspect the design of TensorField may be the root cause as it pushes the tensor to cpu which depletes the file descriptors.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants