Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about torchvision.io.decode_image #4325

Open
lxy443626128 opened this issue Aug 27, 2021 · 3 comments
Open

Question about torchvision.io.decode_image #4325

lxy443626128 opened this issue Aug 27, 2021 · 3 comments

Comments

@lxy443626128
Copy link

when we use torchvision.io.decode_image(img,device = local_rank) to train with ddp,we find num_workers>0 can't work.

RuntimeError: DataLoader worker (pid 58353) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

@fmassa
Copy link
Member

fmassa commented Aug 27, 2021

Hi,

Thanks for the report. Does the code work as expected with num_workers=0? My suspicion is that one of your images is corrupted and is causing troubles with decode_image.
If we can get some more details on the issue we can try to see what is the problem.

Also, note that we have fixed some bugs in image reading recently, see #3948 #4101 and #4268

@lxy443626128
Copy link
Author

lxy443626128 commented Aug 27, 2021

yes, only when set num_workers=0, it works.
there is my dataset code.

class DataSetGPU(data.Dataset):
  def __init__(self, filePathLable,  device_id):
    self.list_file = self.read_file(filePathLable) # imgpath label
    self.cuda = device_id
    self.transform = transforms.Compose([transforms.RandomResizedCrop((224,224),scale=(0.5, 1.0), ratio=(3/4.0, 4/3.0), interpolation=2),
                                           transforms.RandomHorizontalFlip(p=0.5),
                                           transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                                std=[0.229, 0.224, 0.225])])
  def __getitem__(self, idx):
    photopath_label = self.list_file[idx]
    path_label_list = photopath_label.split(' ')
    photopath = path_label_list[0]
    photolabel = path_label_list[1]
    img_tensor = torchvision.io.read_file(imgpath)
    try:
          img = torchvision.io.decode_jpeg(img_tensor,device=self.cuda).float()
    except:
          img = torchvision.io.decode_image(img_tensor).float().cuda()
    img = self.transform(img)
    label = np.int(photolabel)    
    return  img, label

  def __len__(self):
    return len(self.list_file)

  def read_file(self, filename):
    photo_label_list = []
    with open(filename, 'r') as f:
        for line in f.readlines():
            photoPath_label = line.strip()
            photo_label_list.append(photoPath_label)
    return photo_label_list

@fmassa
Copy link
Member

fmassa commented Aug 27, 2021

Oh, the issue happens when doing GPU decoding only?

I believe this might be expected, as doing GPU computations in multiprocessing doesn't work well, and this is irrespective of if it's image decoding or not.

For decoding on the GPU, we might need a different set of tooling in the dataset level to get this working, which we are starting to explore but won't be available soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants