Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3d_lowres Inference RuntimeError: Some background workers are no longer alive #2182

Open
chenney0830 opened this issue May 14, 2024 · 2 comments
Assignees

Comments

@chenney0830
Copy link

chenney0830 commented May 14, 2024

Hello,

I am currently encountering the following error, which occurs during the validation result generation after training 3D_lowres. No files have been generated in either the 'val' folder or the 'predict_from_next_stage' folder. Similarly, I tested 3D_lowres during inference and encountered the same error. According to my resource monitor, the memory has run out, but training and inference on 3D_fullres run smoothly without any issues.

Could you suggest how we might resolve this problem?

Thank you!

2024-05-14 23:55:29.948744: predicting 0001
2024-05-14 23:55:33.011472: predicting 0002
2024-05-14 23:55:35.686156: predicting 0003
2024-05-14 23:55:37.476846: predicting 0004
Traceback (most recent call last):
File "/home/chenney/anaconda3/envs/nnUNet/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/home/chenney/nnUNet/nnunetv2/run/run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/home/chenney/nnUNet/nnunetv2/run/run_training.py", line 208, in run_training
nnunet_trainer.perform_actual_validation(export_validation_probabilities)
File "/home/chenney/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1168, in perform_actual_validation
proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results,
File "/home/chenney/nnUNet/nnunetv2/utilities/file_path_utilities.py", line 103, in check_workers_alive_and_busy
raise RuntimeError('Some background workers are no longer alive')
RuntimeError: Some background workers are no longer alive

@YUjh0729
Copy link

Hello,
I also encountered the same problem. After the model training was completed, this problem occurred when the model automatically validated itself. Additionally, when I used the nnUNetv2_predict command to validate the model, the same issue arose. Did you solve this problem? My error message is as follows:

2024-06-11 01:55:16.900306: predicting FLARE22_046
resizing data, order is 1
data shape (14, 113, 628, 628)
2024-06-11 01:55:26.419984: predicting FLARE22_047
resizing data, order is 1
data shape (14, 93, 465, 465)
------------------如果未准备好的结果数量大于可用工作进程数加上允许排队的数量,则返回 True,否则报错RuntimeError-------------------
Traceback (most recent call last):
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/resource_sharer.py", line 138, in _serve
with self._listener.accept() as conn:
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 466, in accept
answer_challenge(c, self._authkey)
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 757, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/yjh/.conda/envs/umamba/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/home/yjh/.conda/envs/umamba/bin/nnUNetv2_train", line 33, in
sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')())
File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/run/run_training.py", line 268, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/run/run_training.py", line 208, in run_training
nnunet_trainer.perform_actual_validation(export_validation_probabilities)
File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1154, in perform_actual_validation
proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results,
File "/mnt/e/yjh/project/U-Mamba-main/umamba/nnunetv2/utilities/file_path_utilities.py", line 104, in check_workers_alive_and_busy
raise RuntimeError('Some background workers are no longer alive')
RuntimeError: Some background workers are no longer alive

@YUjh0729
Copy link

Moreover, when training the model on another machine with a 4080 GPU in the same environment (the current machine uses a 3090), I sometimes encounter a deadlock situation (which I believe is a deadlock). The CPU and GPU memory are both occupied, but their utilization is around 1%, causing the training to get stuck and unable to progress at a certain epoch. Of course, I have no issues training the model on the 3090 machine, but during validation, I encountered the error "Some background workers are no longer alive."I checked the nnUNet issues and concluded that the CPU's RAM is full, but there is no useful solution available. It's frustrating!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants