Introduce segstats.py script for fast partial volume computation #239

dkuegler · 2023-01-11T16:46:28Z

The FreeSurfer mri_segstats program computes volume statistics from a combination of a segmentation and bias-field corrected image. Fundamentally, it assumes local regional intensity consistency.

In the PR, we introduce three alternatives to this script:

python port of the mri_segstats algorithm (called legacy)
a. python implementation using python and numpy code throughout including multiple processes for parallel execution (this is extremely inefficient and should be avoided, but avoids the python global interpreter lock)
b. python+numba (python code can be properly compiled by llvm) implementation including multiple threads for parallel execution and numba-based parallelization (this code is about as efficient as the FreeSurfer mri_segstats code, and avoids the global interpreter lock)
updated partial volume algorithm implemented in python and numpy including multiple threads for parallel execution (this code is significantly more efficient, but does not avoid the global interpreter lock as it does not require numba and compilation)
updated partial volume algorithm implemented in python and torch with almost no multi-threading but calculation on the GPU (this code is also significantly faster than other version, but optimized using sparse tensors and for SIMD processors)

I am currently running some benchmarks to document the runtime of the respective versions.

@m-reuter there is also a question of whether we want to even include and distribute the "legacy algorithms". I suspect, including them in one commit but removing them again might be an option, but I do not expect we want to add numba as a dependency (which is really required for acceptable run times).

add code for gpu/torch implementation (currently seems to require too much gpu memory)

dkuegler · 2023-01-13T14:56:37Z

benchmarking on landau
6-core, 12 thread
Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz

FreeSurfer 7.3.2 mri_segstats

python -m timeit "import subprocess; subprocess.run('mri_segstats --seed 1234 --seg $TSUB/mri/wmparc.DKTatlas.mapped.mgz --sum $TSUB/stats/wmparc.DKTatlas.mapped.fsst
ats --pv $TSUB/mri/norm.mgz --excludeid 0 --brainmask $TSUB/mri/brainmask.mgz --in $TSUB/mri/norm.mgz --in-intensity-name norm --in-intensity-units MR --subject OAS1_0061_MR1_modified --surf-wm-vol --ctab $FREESURFER/WMParcStatsLUT.txt'.split(' '))"

1 loop, best of 5: 251 sec per loop

251 sec = 4min11

Basically only fully loads a single core/thread -- theoretically, it should be parallelized through openmp, but I cannot find the associated flags.

If we assume parallel execution, best case would be around 21 sec.

FastSurfer 2.0.1* segstats.py

on CPU

python -m timeit "from FastSurferCNN.segstats import main, make_arguments; main(make_arguments().parse_args('-norm $TSUB/mri/norm.mgz -i $TSUB/mri/wmparc.DKTatlas.mapped.mgz -o $TSUB/stats/wmparc.DKTatlas.mapped.pyvstats --lut $FREESURFER/WMParcStatsLUT.txt --device cpu'.split(' ')))"

1 loop, best of 5: 4.06 sec per loop

Loads about 11 of 12 threads (additional overhead by other processes)

Theoretically, there might be a very small advantage here for the python code, because I am not firing up a new process.

on GPU, I am still facing memory constraints, so it remains unclear, if that is possible. Also, it seems to be much smaller because of the sparsity of PV-affected voxels. The solution probably has to be to "patchify" the code similar to the cpu code, but then even more of the potential speed advantage of gpu goes away because of synchronizations and overhead.

Note: numpy might grab additional cores to accelerate the workload no matter the threads flag.

dkuegler · 2023-01-16T17:36:11Z

I removed all non-core code so we do not have to maintain multiple versions of the code. Specifically, even the numba accelerated legacy algorithm was too slow and torch requires very high memory allocation. Furthermore, even the overhead for torch was already slower than the cpu implementation. Therefore, for now there will not be a dedicated torch implementation.

I also added a "--threads" flag to specify the number of threads. This seems to somewhat reliably load one cpu thread. However, numpy has inherent parallelism through MKL and openmp.

https://stackoverflow.com/questions/30791550/limit-number-of-threads-in-numpy

Performance

Even on one thread, the script is still pretty fast:

~/projects/FastSurfer$ python -m timeit "from FastSurferCNN.segstats import main, make_arguments; main(make_arguments().parse_args('-norm $TSUB/mri/norm.mgz -i $TSUB/mri/wmparc.DKTatlas.mapped.mgz -o $TSUB/stats/wmparc.DKTatlas.mapped.pyvstats --lut $FREESURFER/WMParcStatsLUT.txt'.split(' ')))"
1 loop, best of 5: 4.03 sec per loop

This is effectively on 12 threads.

~/projects/FastSurfer$ python -m timeit "from FastSurferCNN.segstats import main, make_arguments; main(make_arguments().parse_args('-norm $TSUB/mri/norm.mgz -i $TSUB/mri/wmparc.DKTatlas.mapped.mgz -o $TSUB/stats/wmparc.DKTatlas.mapped.pyvstats --lut $FREESURFER/WMParcStatsLUT.txt --threads 1'.split(' ')))"
1 loop, best of 5: 11.6 sec per loop

FastSurferCNN/segstats.py

introduce segstats.py script for fast partial volume computation.

4e42cbb

dkuegler changed the title ~~introduce segstats.py script for fast partial volume computation~~ Introduce segstats.py script for fast partial volume computation Jan 11, 2023

fix some errors due to torch optimizations

c950c2f

add code for gpu/torch implementation (currently seems to require too much gpu memory)

dkuegler marked this pull request as ready for review January 13, 2023 14:56

dkuegler requested review from LeHenschel and m-reuter January 13, 2023 14:59

[segstats] remove torch and legacy code, add threads flag

bf7cb59

Note: numpy might grab additional cores to accelerate the workload no matter the threads flag.

dkuegler force-pushed the feature/segstats branch from e710281 to bf7cb59 Compare January 16, 2023 17:02

m-reuter reviewed Jan 16, 2023

View reviewed changes

FastSurferCNN/segstats.py Outdated Show resolved Hide resolved

update crop_patch_to_mask docstring

d74fb51

m-reuter merged commit 2665f91 into Deep-MI:dev Jan 23, 2023

dkuegler deleted the feature/segstats branch January 23, 2023 13:07

dkuegler mentioned this pull request Feb 15, 2023

CerebNet support in FastSurfer #258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce segstats.py script for fast partial volume computation #239

Introduce segstats.py script for fast partial volume computation #239

dkuegler commented Jan 11, 2023

dkuegler commented Jan 13, 2023 •

edited

Loading

dkuegler commented Jan 16, 2023

Introduce segstats.py script for fast partial volume computation #239

Introduce segstats.py script for fast partial volume computation #239

Conversation

dkuegler commented Jan 11, 2023

dkuegler commented Jan 13, 2023 • edited Loading

FreeSurfer 7.3.2 mri_segstats

FastSurfer 2.0.1* segstats.py

dkuegler commented Jan 16, 2023

Performance

dkuegler commented Jan 13, 2023 •

edited

Loading