3d_lowres Inference RuntimeError: Some background workers are no longer alive #2182

chenney0830 · 2024-05-14T16:15:10Z

Hello,

I am currently encountering the following error, which occurs during the validation result generation after training 3D_lowres. No files have been generated in either the 'val' folder or the 'predict_from_next_stage' folder. Similarly, I tested 3D_lowres during inference and encountered the same error. According to my resource monitor, the memory has run out, but training and inference on 3D_fullres run smoothly without any issues.

Could you suggest how we might resolve this problem?

Thank you!

2024-05-14 23:55:29.948744: predicting 0001
2024-05-14 23:55:33.011472: predicting 0002
2024-05-14 23:55:35.686156: predicting 0003
2024-05-14 23:55:37.476846: predicting 0004
Traceback (most recent call last):
File "/home/chenney/anaconda3/envs/nnUNet/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/home/chenney/nnUNet/nnunetv2/run/run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/home/chenney/nnUNet/nnunetv2/run/run_training.py", line 208, in run_training
nnunet_trainer.perform_actual_validation(export_validation_probabilities)
File "/home/chenney/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1168, in perform_actual_validation
proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results,
File "/home/chenney/nnUNet/nnunetv2/utilities/file_path_utilities.py", line 103, in check_workers_alive_and_busy
raise RuntimeError('Some background workers are no longer alive')
RuntimeError: Some background workers are no longer alive

YUjh0729 · 2024-06-13T06:46:00Z

Hello，
I also encountered the same problem. After the model training was completed, this problem occurred when the model automatically validated itself. Additionally, when I used the nnUNetv2_predict command to validate the model, the same issue arose. Did you solve this problem? My error message is as follows:

2024-06-11 01:55:16.900306: predicting FLARE22_046
resizing data, order is 1
data shape (14, 113, 628, 628)
2024-06-11 01:55:26.419984: predicting FLARE22_047
resizing data, order is 1
data shape (14, 93, 465, 465)
------------------如果未准备好的结果数量大于可用工作进程数加上允许排队的数量，则返回 True，否则报错RuntimeError-------------------
Traceback (most recent call last):
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/resource_sharer.py", line 138, in _serve
with self._listener.accept() as conn:
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 466, in accept
answer_challenge(c, self._authkey)
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 757, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/home/yjh/.conda/envs/umamba/bin/nnUNetv2_train", line 33, in
sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')())
File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/run/run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/run/run_training.py", line 208, in run_training
nnunet_trainer.perform_actual_validation(export_validation_probabilities)
File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1154, in perform_actual_validation
proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results,
File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/utilities/file_path_utilities.py", line 104, in check_workers_alive_and_busy
raise RuntimeError('Some background workers are no longer alive')
RuntimeError: Some background workers are no longer alive

YUjh0729 · 2024-06-13T06:55:23Z

Moreover, when training the model on another machine with a 4080 GPU in the same environment (the current machine uses a 3090), I sometimes encounter a deadlock situation (which I believe is a deadlock). The CPU and GPU memory are both occupied, but their utilization is around 1%, causing the training to get stuck and unable to progress at a certain epoch. Of course, I have no issues training the model on the 3090 machine, but during validation, I encountered the error "Some background workers are no longer alive."I checked the nnUNet issues and concluded that the CPU's RAM is full, but there is no useful solution available. It's frustrating!

FabianIsensee assigned Lars-Kraemer May 14, 2024

surajpaib mentioned this issue Oct 22, 2024

Optional Torch Multiprocessing in nnUNet for Improved Security and Compatibility #2556

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3d_lowres Inference RuntimeError: Some background workers are no longer alive #2182

3d_lowres Inference RuntimeError: Some background workers are no longer alive #2182

chenney0830 commented May 14, 2024 •

edited

Loading

YUjh0729 commented Jun 13, 2024

YUjh0729 commented Jun 13, 2024

3d_lowres Inference RuntimeError: Some background workers are no longer alive #2182

3d_lowres Inference RuntimeError: Some background workers are no longer alive #2182

Comments

chenney0830 commented May 14, 2024 • edited Loading

YUjh0729 commented Jun 13, 2024

YUjh0729 commented Jun 13, 2024

chenney0830 commented May 14, 2024 •

edited

Loading