Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch dataloader with datakek can't pickle transforms lamda fucntion on windows #26

Open
metya opened this issue Jul 24, 2019 · 3 comments

Comments

@metya
Copy link
Contributor

metya commented Jul 24, 2019

On Windows there is a bug, described here https://discuss.pytorch.org/t/cant-pickle-local-object-dataloader-init-locals-lambda/31857 and here pytorch/vision#689 and here pytorch/ignite#377

but it is appeared when num_workers for torch DataLoader more then 0. When num_workers=0 it is goes normal.

So if num_workers > 0 and there is lambda function in transforms code, for example:

def get_transforms(dataset_key, size, p):
    PRE_TFMS = Transformer(dataset_key, lambda x: cv2.resize(x, (size, size))) # <-- here
    AUGS = Transformer(dataset_key, lambda x: augs()(image=x)["image"]) # <-- here
    NRM_TFMS = transforms.Compose([
        Transformer(dataset_key, to_torch()), # <-- and here inside to_torch() there is lambda 
        Transformer(dataset_key, normalize())
    ])
    train_tfms = transforms.Compose([PRE_TFMS, AUGS, NRM_TFMS])
    val_tfms = transforms.Compose([PRE_TFMS, NRM_TFMS])
    return train_tfms, val_tfms

I get exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-35-87bd5485ec48> in <module>
      4 # !rm -r lrlogs/*
      5 
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
      7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
    407             self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
    408             self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409                      opt=opt, opt_params=opt_params)
    410         finally:
    411             self.callbacks = callbacks

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
    276             for epoch in range(epochs):
    277                 self.set_mode("train")
--> 278                 self._run_epoch(epoch, epochs)
    279 
    280                 if not skip_val:

D:\metya\Anaconda3\lib\site-packages\kekas-0.1.17-py3.7.egg\kekas\keker.py in _run_epoch(self, epoch, epochs)
    425 
    426         with torch.set_grad_enabled(self.is_train):
--> 427             for i, batch in enumerate(self.state.core.loader):
    428                 self.callbacks.on_batch_begin(i, self.state)
    429 

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
    191 
    192     def __iter__(self):
--> 193         return _DataLoaderIter(self)
    194 
    195     def __len__(self):

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    467                 #     before it starts, and __del__ tries to join but will get:
    468                 #     AssertionError: can only join a started process.
--> 469                 w.start()
    470                 self.index_queues.append(index_queue)
    471                 self.workers.append(w)

D:\metya\Anaconda3\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

D:\metya\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

D:\metya\Anaconda3\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     87             try:
     88                 reduction.dump(prep_data, to_child)
---> 89                 reduction.dump(process_obj, to_child)
     90             finally:
     91                 set_spawning_popen(None)

D:\metya\Anaconda3\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object 'get_transforms.<locals>.<lambda>'

So I changed all lambda functions to normal and replace to_torch() to torchvision.transform.ToTensor() (even monkey patched source kekas transformation.py)

and it works for me with num_workers=0
if num_workers > 0 it is fails with

---------------------------------------------------------------------------
Empty                                     Traceback (most recent call last)
D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
    510         try:
--> 511             data = self.data_queue.get(timeout=timeout)
    512             return (True, data)

D:\metya\Anaconda3\lib\multiprocessing\queues.py in get(self, block, timeout)
    104                     if not self._poll(timeout):
--> 105                         raise Empty
    106                 elif not self._poll():

Empty: 

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-106-87bd5485ec48> in <module>
      4 # !rm -r lrlogs/*
      5 
----> 6 BCE_keker.kek_lr(final_lr=0.1, logdir=lrlogdir)
      7 # BCE_keker.plot_kek_lr(logdir=lrlogdir)

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek_lr(self, final_lr, logdir, init_lr, n_steps, opt, opt_params)
    407             self.callbacks = Callbacks(self.core_callbacks + [lrfinder_cb])
    408             self.kek(lr=init_lr, epochs=n_epochs, skip_val=True, logdir=logdir,
--> 409                      opt=opt, opt_params=opt_params)
    410         finally:
    411             self.callbacks = callbacks

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in kek(self, lr, epochs, skip_val, opt, opt_params, sched, sched_params, stop_iter, logdir, cp_saver_params, early_stop_params)
    276             for epoch in range(epochs):
    277                 self.set_mode("train")
--> 278                 self._run_epoch(epoch, epochs)
    279 
    280                 if not skip_val:

D:\metya\Anaconda3\lib\site-packages\kekas\keker.py in _run_epoch(self, epoch, epochs)
    425 
    426         with torch.set_grad_enabled(self.is_train):
--> 427             for i, batch in enumerate(self.state.core.loader):
    428                 self.callbacks.on_batch_begin(i, self.state)
    429 

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __next__(self)
    574         while True:
    575             assert (not self.shutdown and self.batches_outstanding > 0)
--> 576             idx, batch = self._get_batch()
    577             self.batches_outstanding -= 1
    578             if idx != self.rcvd_idx:

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _get_batch(self)
    551         else:
    552             while True:
--> 553                 success, data = self._try_get_batch()
    554                 if success:
    555                     return data

D:\metya\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in _try_get_batch(self, timeout)
    517             if not all(w.is_alive() for w in self.workers):
    518                 pids_str = ', '.join(str(w.pid) for w in self.workers if not w.is_alive())
--> 519                 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
    520             if isinstance(e, queue.Empty):
    521                 return (False, None)

RuntimeError: DataLoader worker (pid(s) 11236, 6592) exited unexpectedly

I think it is a common bug with workers on Windows
I found related issues like that pytorch/pytorch#8976 or like that pytorch/pytorch#5301

Moreover it is funny, but if num_workers set 0 and return lambdas back everything works fine.

So maybe it is not be the lambdas in code of kekas, but in fucking windows and dataloaders and multiprocessing and I don't know.

@fschlatt
Copy link

fschlatt commented Aug 19, 2019

When setting num_workers=0 the dataloader doesn't use multiprocessing. Multiprocessing is only used for num_workers>0 . That's why it works for num_workers=0

@belskikh
Copy link
Owner

Yeah, it's a Windows bug, cannot reproduce

@rfan-debug
Copy link

It seems that Mac has the same issue. Here is my runtime config:

Apple M2 Max
Python 3.10.9
torch==1.13.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants