pytorch dataloader with datakek can't pickle transforms lamda fucntion on windows #26

metya · 2019-07-24T01:12:53Z

On Windows there is a bug, described here https://discuss.pytorch.org/t/cant-pickle-local-object-dataloader-init-locals-lambda/31857 and here pytorch/vision#689 and here pytorch/ignite#377

but it is appeared when num_workers for torch DataLoader more then 0. When num_workers=0 it is goes normal.

So if num_workers > 0 and there is lambda function in transforms code, for example:

def get_transforms(dataset_key, size, p):
    PRE_TFMS = Transformer(dataset_key, lambda x: cv2.resize(x, (size, size))) # <-- here
    AUGS = Transformer(dataset_key, lambda x: augs()(image=x)["image"]) # <-- here
    NRM_TFMS = transforms.Compose([
        Transformer(dataset_key, to_torch()), # <-- and here inside to_torch() there is lambda 
        Transformer(dataset_key, normalize())
    ])
    train_tfms = transforms.Compose([PRE_TFMS, AUGS, NRM_TFMS])
    val_tfms = transforms.Compose([PRE_TFMS, NRM_TFMS])
    return train_tfms, val_tfms

I get exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-35-87bd5485ec48> in <module>
      4 # !rm -r lrlogs/*
      5 
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
      7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
    407             self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
    408             self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409                      opt=opt, opt_params=opt_params)
    410         finally:
    411             self.callbacks = callbacks

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
    276             for epoch in range(epochs):
    277                 self.set_mode("train")
--> 278                 self._run_epoch(epoch, epochs)
    279 
    280                 if not skip_val:

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in _run_epoch(self, epoch, epochs)
    425 
    426         with torch.set_grad_enabled(self.is_train):
--> 427             for i, batch in enumerate(self.state.core.loader):
    428                 self.callbacks.on_batch_begin(i, self.state)
    429 

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
    191 
    192     def __iter__(self):
--> 193         return _DataLoaderIter(self)
    194 
    195     def __len__(self):

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    467                 #     before it starts, and __del__ tries to join but will get:
    468                 #     AssertionError: can only join a started process.
--> 469                 w.start()
    470                 self.index_queues.append(index_queue)
    471                 self.workers.append(w)

D:\metya\Anaconda3\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

D:\metya\Anaconda3\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     87             try:
     88                 reduction.dump(prep_data, to_child)
---> 89                 reduction.dump(process_obj, to_child)
     90             finally:
     91                 set_spawning_popen(None)

D:\metya\Anaconda3\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object 'get_transforms.<locals>.<lambda>'

So I changed all lambda functions to normal and replace to_torch() to torchvision.transform.ToTensor() (even monkey patched source kekas transformation.py)

and it works for me with num_workers=0
if num_workers > 0 it is fails with

---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
    510         try:
--> 511             data = self.data_queue.get(timeout=timeout)
    512             return (True, data)

D:\metya\Anaconda3\lib\multiprocessing\queues.py in get(self, block, timeout)
    104                     if not self._poll(timeout):
--> 105                         raise Empty
    106                 elif not self._poll():

Empty: 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-106-87bd5485ec48> in <module>
      4 # !rm -r lrlogs/*
      5 
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
      7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
    407             self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
    408             self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409                      opt=opt, opt_params=opt_params)
    410         finally:
    411             self.callbacks = callbacks

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
    276             for epoch in range(epochs):
    277                 self.set_mode("train")
--> 278                 self._run_epoch(epoch, epochs)
    279 
    280                 if not skip_val:

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in _run_epoch(self, epoch, epochs)
    425 
    426         with torch.set_grad_enabled(self.is_train):
--> 427             for i, batch in enumerate(self.state.core.loader):
    428                 self.callbacks.on_batch_begin(i, self.state)
    429 

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    574         while True:
    575             assert (not self.shutdown and self.batches_outstanding > 0)
--> 576             idx, batch = self._get_batch()
    577             self.batches_outstanding -= 1
    578             if idx != self.rcvd_idx:

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _get_batch(self)
    551         else:
    552             while True:
--> 553                 success, data = self._try_get_batch()
    554                 if success:
    555                     return data

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
    517             if not all(w.is_alive() for w in self.workers):
    518                 pids_str = ', '.join(str(w.pid) for w in self.workers if not w.is_alive())
--> 519                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    520             if isinstance(e, queue.Empty):
    521                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 11236, 6592) exited unexpectedly

I think it is a common bug with workers on Windows
I found related issues like that pytorch/pytorch#8976 or like that pytorch/pytorch#5301

Moreover it is funny, but if num_workers set 0 and return lambdas back everything works fine.

So maybe it is not be the lambdas in code of kekas, but in fucking windows and dataloaders and multiprocessing and I don't know.

The text was updated successfully, but these errors were encountered:

fschlatt · 2019-08-19T16:27:58Z

When setting num_workers=0 the dataloader doesn't use multiprocessing. Multiprocessing is only used for num_workers>0 . That's why it works for num_workers=0

belskikh · 2019-10-15T16:21:47Z

Yeah, it's a Windows bug, cannot reproduce

rfan-debug · 2023-09-14T00:40:16Z

It seems that Mac has the same issue. Here is my runtime config:

Apple M2 Max
Python 3.10.9
torch==1.13.1

jonathan-owens mentioned this issue Mar 25, 2020

Updates to lambda transofrmation to make it Windows compatible bshall/VectorQuantizedVAE#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch dataloader with datakek can't pickle transforms lamda fucntion on windows #26

pytorch dataloader with datakek can't pickle transforms lamda fucntion on windows #26

metya commented Jul 24, 2019

fschlatt commented Aug 19, 2019 •

edited

Loading

belskikh commented Oct 15, 2019

rfan-debug commented Sep 14, 2023

pytorch dataloader with datakek can't pickle transforms lamda fucntion on windows #26

pytorch dataloader with datakek can't pickle transforms lamda fucntion on windows #26

Comments

metya commented Jul 24, 2019

fschlatt commented Aug 19, 2019 • edited Loading

belskikh commented Oct 15, 2019

rfan-debug commented Sep 14, 2023

fschlatt commented Aug 19, 2019 •

edited

Loading