-
Notifications
You must be signed in to change notification settings - Fork 6.8k
"socket.error: [Errno 111] Connection refused" while training with multiple workers #11872
Comments
This is related to recent change that we switched from shared memory to file descriptor on linux for inter-processing communication. Still investigating solutions for that. |
Temporary solutions:
|
I have figured out that the pre-fetch strategy for data loader is too aggressive which might cause the related issue with shared mem. |
Thanks @zhreshold , I will follow this PR. Yes, even with few (0/1) workers resource usage was quite high requiring more than usual shared memory space. |
With #11908 been merged, I am closing this for now. Feel free to ping me if it still exists. |
I am using the latest master and the issue still persists in docker. Even num_workers = 1 causes a hang in the dataloader's while True loop |
@ifeherva |
@zhreshold Good point. How much shared memory is recommended for mxnet? |
That should align with the (input batch_size, data shape, worker number), usually several GB is recommended for multi-gpu training. |
@zhreshold Adding shared memory to docker solved the problem. Thanks! |
@djaym7 This worked for me: aws/sagemaker-python-sdk#937 (comment) |
Hi,
I am getting following error after few data iteration @ 551/22210:
File "train.py", line 201, in
trainer.training(epoch)
File "train.py", line 142, in training
for i, (data, target) in enumerate(tbar):
File "/usr/local/lib/python2.7/dist-packages/tqdm/_tqdm.py", line 930, in iter
for obj in iterable:
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 222, in next
return self.next()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 218, in next
idx, batch = self._data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 117, in get
res = self._recv()
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 88, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/mxnet/gluon/data/dataloader.py", line 53, in rebuild_ndarray
fd = multiprocessing.reduction.rebuild_handle(fd)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 156, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 169, in Client
c = SocketClient(address)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 308, in SocketClient
s.connect(address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused
I am using latest nightly of MXNET along with newly added Sync BatchNorm layer , This error comes with and without SyncBatchNorm layer.
I am using MXNET docker
Any help is much appreciated.
dmlc/gluon-cv#215
@zhreshold would you be able to comment on this?
The text was updated successfully, but these errors were encountered: