Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much time and ram while saving the inference results #2014

Closed
iskenderkahramanoglu opened this issue Mar 14, 2024 · 21 comments
Closed

Too much time and ram while saving the inference results #2014

iskenderkahramanoglu opened this issue Mar 14, 2024 · 21 comments
Assignees

Comments

@iskenderkahramanoglu
Copy link

Hello!

I have trained two model with full resolution.
The first model have 32 class and the second is 5 class.
When i try to take test, sometimes inference time is too long.

Cases:
512x512x246 nifti,
32 class model: 4 seconds for inference, 3 minutes to save results (112 steps in tqdm)
5 class model: 1:29 minutes for inference, 20 seconds to save results (245 steps in tqdm)

801x801x458 nifti,
32 class model: 10:17 minutes for inference, 2:30 minutes to save results (1100 steps in tqdm)
5 class model: 11:12 minutes for inference, 53 minutes to save results (2080 steps in tqdm)

What is the difference?
Less class model takes more time.
Have can I calculate tqdm step size for each model and nifti?
Is there any way to take results as another type from nifti, for example json?
Because CPU saving is too slow and use too much RAM memory.
Sometimes it use all RAM and system crashes.
I have 220 GB RAM, Tesla V100 and 46 thread processor.
But saving the result is use only one thread.

What can I do for less inference time?

bash_screen

@iskenderkahramanoglu
Copy link
Author

Today I tried to take inference a nifti with 1000x1000x1000 size and (0.2, 0.2, 0.2) pixel spacing.
32 class model takes inference in 16 seconds, but save as nifti spents too much ram memory and crashes my system.
When I resample the nifti, the new size is 500x500x500, new spacing is (0.4, 0.4, 0.4) and time inference time is same.
Previously, when I resampled a nifti and increase the pixel spacing, the extraction time was down.
The results are successful, but it is impossible to use them in real life with these inference times.
What should I do? Will a better graphics card, more ram, etc. help me reduce these times?

@iskenderkahramanoglu
Copy link
Author

Hi @FabianIsensee , do you have any idea about this?

@ancestor-mithril
Copy link
Contributor

You can try to split the bigger volume into multiple smaller patches. For example, you can split it into 9, 16 or 25 patches. Then you do the inference on each patch separately and ultimately aggregate the result into a segmentation for the bigger volume.
In this case, you have to pay attention to have some overlapping between patches. Otherwise, the segmentation for the margins might not be very accurate.

@iskenderkahramanoglu
Copy link
Author

You can try to split the bigger volume into multiple smaller patches. For example, you can split it into 9, 16 or 25 patches. Then you do the inference on each patch separately and ultimately aggregate the result into a segmentation for the bigger volume. In this case, you have to pay attention to have some overlapping between patches. Otherwise, the segmentation for the margins might not be very accurate.

There is no set standard for the volumes to be tested. There are also multiple models predicting different parts of the volume. Therefore, overlap between patches will be very difficult. Distortion at the margins will be inevitable. Despite this, do you recommend split the volume into small patches or would you have a different suggestion?

@ancestor-mithril
Copy link
Contributor

Overlap between patches is not difficult, nnUNet already does this. You just need to patchify with overlap once more in order to reduce RAM usage and to speed up the inference.

@iskenderkahramanoglu
Copy link
Author

Overlap between patches is not difficult, nnUNet already does this. You just need to patchify with overlap once more in order to reduce RAM usage and to speed up the inference.

Thanks for reply!

Do you mean physically splitting the test nifti file to 9 (or 16 or 25) pieces, or is there a simple way to do this in nnUnet? Can I split it into smaller patches by changing the "patch_size" parameter in the json?

What determines the testing and recording time of a nifti? Why does a nifti of size 801x801x458 with (0.4, 0.4, 0.4, 0.4) pixel spacing take longer to test and record than a nifti of size 1000x1000x1000 with (0.2, 0.2, 0.2, 0.2) pixel spacing?

@ancestor-mithril
Copy link
Contributor

I suggest to physically split the test nifti file (you can use the patchly library).
You can't change the "patch_size" parameter in the json, unless you want to retrain the model. Each model is trained with a specific "patch_size" and "spacing".

What determines the testing and recording time of a nifti?

nnUNet does 3 things for inference:

  • preprocessing (cropping + normalization + resampling)
  • sliding window model inference
  • postprocessing (resampling and exporting)

Why does a nifti of size 801x801x458 with (0.4, 0.4, 0.4, 0.4) pixel spacing take longer to test and record than a nifti of size 1000x1000x1000 with (0.2, 0.2, 0.2, 0.2) pixel spacing?

It depends on the target spacing with which nnUNet model was trained because each case is resampled to that spacing. Or the 1000x1000x1000 case may have been cropped to a smaller size because it is zero on the margins.

@iskenderkahramanoglu
Copy link
Author

OK, I will try to split a file.
Thank you very much.

@iskenderkahramanoglu
Copy link
Author

I suggest to physically split the test nifti file (you can use the patchly library). You can't change the "patch_size" parameter in the json, unless you want to retrain the model. Each model is trained with a specific "patch_size" and "spacing".

What determines the testing and recording time of a nifti?

nnUNet does 3 things for inference:

  • preprocessing (cropping + normalization + resampling)
  • sliding window model inference
  • postprocessing (resampling and exporting)

Why does a nifti of size 801x801x458 with (0.4, 0.4, 0.4, 0.4) pixel spacing take longer to test and record than a nifti of size 1000x1000x1000 with (0.2, 0.2, 0.2, 0.2) pixel spacing?

It depends on the target spacing with which nnUNet model was trained because each case is resampled to that spacing. Or the 1000x1000x1000 case may have been cropped to a smaller size because it is zero on the margins.

I am looked patchly library but I didn't understand.
In nnUnet, prediction saving as a nifti file.
If I didn't understand wrongly, patchly split the image to virtual patches, takes prediction then merge.
How can I use patchly library in nnUnet, is there any example?

@ancestor-mithril
Copy link
Contributor

You split the images into patches and then save them as nifti files. After predicting on all the patches, you aggregate the segmentation results into full-sized images.

@iskenderkahramanoglu
Copy link
Author

iskenderkahramanoglu commented Mar 22, 2024

You split the images into patches and then save them as nifti files. After predicting on all the patches, you aggregate the segmentation results into full-sized images.

I tried to split nifti file to 27 patchs and inference them.
Inference time is under 1 second for each patch but this also use all 200 GB RAM and system is crashes.
I have another question, if I convert label files (maybe image files also) from float64 to uint8, will inference time and ram using decrease?
Will this decrease training success?

@ancestor-mithril
Copy link
Contributor

To reduce RAM usage you can decrease the number of processes used for preprocessing and segmentation export (see nnUNetv2_predict -h). You should use only 1 process for preprocessing and 1 for segmentation.

@iskenderkahramanoglu
Copy link
Author

To reduce RAM usage you can decrease the number of processes used for preprocessing and segmentation export (see nnUNetv2_predict -h). You should use only 1 process for preprocessing and 1 for segmentation.

Using process as 1 is reduced RAM usage, system is not crashed.
But there is 27 patch file and prediction waiting a few time after each file.
So total time is 11 minutes.
I am also try to inference the full nifti file with set process as 1.
Again RAM usage reduced, but time is 12 minutes.
So this prediction time is not acceptable for me.

@x1y9
Copy link

x1y9 commented Apr 1, 2024

I have the same issue, the export time is about 35s and the GPU inference time is only about 8s, after disable the TTA, the GPU inference time drop to 1s, but the export time is still 35s.

So the performance bottleneck is the exporting.

@mrokuss
Copy link
Contributor

mrokuss commented Apr 23, 2024

Hey @iskenderkahramanoglu

It indeed seems like you have issues with the segmentation export. So if you increase the number of workers the export is of course faster, however on the other hand you risk running out of RAM. Large 3D volumes are always tricky to work with. Regarding your issue with the overlapping patches, the nnUNetPredictor takes the following default arguments:

nnUNetPredictor(tile_step_size: float = 0.5,
                   use_gaussian: bool = True,
                   use_mirroring: bool = True,
                   perform_everything_on_device: bool = True,
                   device: torch.device = torch.device('cuda'),
                   verbose: bool = False,
                   verbose_preprocessing: bool = False,
                   allow_tqdm: bool = True)

Here you can set the tile_step_size to a value higher than 0.5 (but maximum 1) in order to determine how much overlap there is between the patches. Higher overlap usually leads to better performance though. If you set use_mirroring = False (disabling test time augmentation) your inference will be much faster, again at the cost of performance.

@mrokuss
Copy link
Contributor

mrokuss commented May 28, 2024

Closing. Feel free to reopen if you still have questions!

@mrokuss mrokuss closed this as completed May 28, 2024
@YUjh0729
Copy link

YUjh0729 commented Aug 4, 2024

I encountered a very strange issue when using the nnUNetv2_predict command. The program can't proceed and is unable to output the prediction results.
These are the results I predicted on the cloud server,
`Predicting FLARE22_010:
perform_everything_on_device: True
0%| | 0/360 [00:00<?, ?it/s]resizing data, order is 3
data shape (1, 227, 512, 512)
11%|████████████████████▍ | 38/360 [00:05<00:48, 6.65it/s]resizing segmentation, order is 1 order z is 0
data shape (1, 227, 512, 512)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 360/360 [00:54<00:00, 6.60it/s]
sending off prediction to background worker for resampling and export
done with FLARE22_010

Predicting FLARE22_011:
perform_everything_on_device: True
38%|██████████████████████████████████████████████████████████████████████████▊ | 23/60 [00:03<00:05, 6.61it/s]resizing data, order is 1
data shape (14, 250, 628, 628)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:08<00:00, 6.74it/s]
sending off prediction to background worker for resampling and export
done with FLARE22_011
resizing data, order is 1
data shape (14, 109, 430, 430)`
屏幕截图 2024-08-04 123235


and these are the results I tested locally,The output is similar, but there are these two additional lines of output.
Both environments are identical:torch2.0.1,cudu11.8,python3.10
perform_everything_on_device: True Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU
WechatIMG844

@mrokuss
Copy link
Contributor

mrokuss commented Aug 4, 2024

Hey @YUjh0729

This is hard to judge form afar but my first guess would be that your local GPU does not have sufficient VRAM and fails for that particular FLARE case. nnUNet always tries to perform as many operations as possible on the GPU (already storing the image here instead of just the patches) in order to increase speed. If this fails then it falls back to using the GPU just for the individual patches and keeps the image on the CPU, this however takes longer. And this is also when you get this error message. You can set „perform_everything_on_device=False“ in the Predictor to immediately go with the second option.

Best, Max

@YUjh0729
Copy link

YUjh0729 commented Aug 4, 2024

Hi @mrokuss
Thank you very much for your response. I tried setting it to False, but essentially, it is the same issue. The data is stuck in the CPU and, after processing, it cannot be exported to the output folder. The program gets stuck and cannot proceed further.
Snipaste_2024-08-04_20-26-46

@pooya-mohammadi
Copy link

@YUjh0729 check my implementation https://github.com/pooya-mohammadi/nnUNet
I also created a pull request #2545

@YUjh0729
Copy link

@pooya-mohammadi
Hi, cool! It works. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants