Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional Torch Multiprocessing in nnUNet for Improved Security and Compatibility #2556

Open
LennyN95 opened this issue Oct 16, 2024 · 8 comments · May be fixed by #2614
Open

Optional Torch Multiprocessing in nnUNet for Improved Security and Compatibility #2556

LennyN95 opened this issue Oct 16, 2024 · 8 comments · May be fixed by #2614
Assignees

Comments

@LennyN95
Copy link

Dear nnUNet team,

We are currently facing challenges when running nnUNet in Docker containers.
The requirement to use flags like --ipc=host or --shm-size for torch.multiprocessing (as reported and discussed in previous issues) makes it difficult to deploy models in our mhub.ai platform. We hesitate to suggest the use of --ipc=host, which is a simple solution but removes security restrictions and should therefore be used with caution. On the other hand, manually specifying the shm-size means an additional burden and makes the inference even more complicated from MHub's point of view.

We currently have some contributions to our portfolio on hold to discuss this topic. In our particular case, the inference is executed sequentially and therefore does not require multiprocessing per se. We propose to make the use of torch.multiprocessing optional during inference.

We welcome any comments and an open discussion on this topic.

Thank you very much!
Leo.

@surajpaib
Copy link

Hi Team,

I would also like to address this issue but from the perspective of user debugging. I think having an option to set num_workers=0 in the inference dataloader would allow users to better debug error traces that occur.

I see a few issues that would probably benefit from this as well:
#2509
#2514
#2182

Looking forward to hear what you think

@Lars-Kraemer

@TaWald
Copy link
Member

TaWald commented Nov 5, 2024

Sounds like a reasonable proposal.
Currently, time is short on our side so if one of you, @surajpaib or @LennyN95, would be willing to create a PR with the feature in question we can certainly integrate this.

@FabianIsensee
Copy link
Member

Setting num_workers=0 will probably be more work than just offering a new member function of nnUNetPredictor that covers this. This function would then not be supported via command line but if you are packing nnU-Net in a Docker you might not need that anyways. Would this be a sensible solution to your problem?
Best,
Fabian

@PereGinebra
Copy link

I've come along similar issues when trying to deploy nnUNet models in docker containers, and although my multiprocessing issues might come from somewhere else, I also agree that having a sequential predictor might simplify a lot of things, especially when you don't need to do inference in big batches.

In my use-case, a quick fix that didn't require messing with your code is to simply use the predict_single_npy_array function while looping through my input arrays. This isn't a big issue for me as I already do all the dataloading manually, but might be a bit more cumbersome for other applications.

@FabianIsensee
Copy link
Member

There is now a predict_from_files_sequential function in nnUNetPredictor that should do what you need :-)

@LennyN95
Copy link
Author

LennyN95 commented Dec 9, 2024

@FabianIsensee Thanks Fabian! Did you have a look into the PR we opened and my comment on reproducibility? Is your predict_from_files_sequential used automatically when workers are set to 0?

@LennyN95
Copy link
Author

@FabianIsensee Any updates?

@LennyN95
Copy link
Author

@FabianIsensee I have noticed some (significant) differences between nnunetv2==2.0 and nnunetv2==2.5.1. This is beyond the scope of this topic / PR, but since some of our contributors have used this version, we'd like to see if and to what extent we can offer them a solution as well. In general, I am curious if you have any idea what changes might have led to these differences (we are talking about a Dice score of 0.9451 between these versions for a lung nodule segmentation task).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants