Question about torchvision.io.decode_image #4325

lxy443626128 · 2021-08-27T09:14:43Z

when we use torchvision.io.decode_image(img,device = local_rank) to train with ddp，we find num_workers>0 can't work.

RuntimeError: DataLoader worker (pid 58353) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

fmassa · 2021-08-27T09:31:24Z

Hi,

Thanks for the report. Does the code work as expected with num_workers=0? My suspicion is that one of your images is corrupted and is causing troubles with decode_image.
If we can get some more details on the issue we can try to see what is the problem.

Also, note that we have fixed some bugs in image reading recently, see #3948 #4101 and #4268

lxy443626128 · 2021-08-27T10:06:56Z

yes, only when set num_workers=0, it works.
there is my dataset code.

class DataSetGPU(data.Dataset):
  def __init__(self, filePathLable,  device_id):
    self.list_file = self.read_file(filePathLable) # imgpath label
    self.cuda = device_id
    self.transform = transforms.Compose([transforms.RandomResizedCrop((224,224),scale=(0.5, 1.0), ratio=(3/4.0, 4/3.0), interpolation=2),
                                           transforms.RandomHorizontalFlip(p=0.5),
                                           transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                                std=[0.229, 0.224, 0.225])])
  def __getitem__(self, idx):
    photopath_label = self.list_file[idx]
    path_label_list = photopath_label.split(' ')
    photopath = path_label_list[0]
    photolabel = path_label_list[1]
    img_tensor = torchvision.io.read_file(imgpath)
    try:
          img = torchvision.io.decode_jpeg(img_tensor,device=self.cuda).float()
    except:
          img = torchvision.io.decode_image(img_tensor).float().cuda()
    img = self.transform(img)
    label = np.int(photolabel)    
    return  img, label

  def __len__(self):
    return len(self.list_file)

  def read_file(self, filename):
    photo_label_list = []
    with open(filename, 'r') as f:
        for line in f.readlines():
            photoPath_label = line.strip()
            photo_label_list.append(photoPath_label)
    return photo_label_list

fmassa · 2021-08-27T12:04:08Z

Oh, the issue happens when doing GPU decoding only?

I believe this might be expected, as doing GPU computations in multiprocessing doesn't work well, and this is irrespective of if it's image decoding or not.

For decoding on the GPU, we might need a different set of tooling in the dataset level to get this working, which we are starting to explore but won't be available soon.

fmassa added awaiting response module: io needs reproduction labels Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about torchvision.io.decode_image #4325

Question about torchvision.io.decode_image #4325

lxy443626128 commented Aug 27, 2021

fmassa commented Aug 27, 2021

lxy443626128 commented Aug 27, 2021 •

edited

Loading

fmassa commented Aug 27, 2021

Question about torchvision.io.decode_image #4325

Question about torchvision.io.decode_image #4325

Comments

lxy443626128 commented Aug 27, 2021

fmassa commented Aug 27, 2021

lxy443626128 commented Aug 27, 2021 • edited Loading

fmassa commented Aug 27, 2021

lxy443626128 commented Aug 27, 2021 •

edited

Loading