-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce segstats.py script for fast partial volume computation #239
Conversation
add code for gpu/torch implementation (currently seems to require too much gpu memory)
benchmarking on landau FreeSurfer 7.3.2 mri_segstats
1 loop, best of 5: 251 sec per loop 251 sec = 4min11 Basically only fully loads a single core/thread -- theoretically, it should be parallelized through openmp, but I cannot find the associated flags. If we assume parallel execution, best case would be around 21 sec. FastSurfer 2.0.1* segstats.pyon CPU
1 loop, best of 5: 4.06 sec per loop Loads about 11 of 12 threads (additional overhead by other processes) Theoretically, there might be a very small advantage here for the python code, because I am not firing up a new process. on GPU, I am still facing memory constraints, so it remains unclear, if that is possible. Also, it seems to be much smaller because of the sparsity of PV-affected voxels. The solution probably has to be to "patchify" the code similar to the cpu code, but then even more of the potential speed advantage of gpu goes away because of synchronizations and overhead. |
Note: numpy might grab additional cores to accelerate the workload no matter the threads flag.
e710281
to
bf7cb59
Compare
I removed all non-core code so we do not have to maintain multiple versions of the code. Specifically, even the numba accelerated legacy algorithm was too slow and torch requires very high memory allocation. Furthermore, even the overhead for torch was already slower than the cpu implementation. Therefore, for now there will not be a dedicated torch implementation. I also added a "--threads" flag to specify the number of threads. This seems to somewhat reliably load one cpu thread. However, numpy has inherent parallelism through MKL and openmp. https://stackoverflow.com/questions/30791550/limit-number-of-threads-in-numpy PerformanceEven on one thread, the script is still pretty fast:
This is effectively on 12 threads.
|
The FreeSurfer mri_segstats program computes volume statistics from a combination of a segmentation and bias-field corrected image. Fundamentally, it assumes local regional intensity consistency.
In the PR, we introduce three alternatives to this script:
a. python implementation using python and numpy code throughout including multiple processes for parallel execution (this is extremely inefficient and should be avoided, but avoids the python global interpreter lock)
b. python+numba (python code can be properly compiled by llvm) implementation including multiple threads for parallel execution and numba-based parallelization (this code is about as efficient as the FreeSurfer mri_segstats code, and avoids the global interpreter lock)
I am currently running some benchmarks to document the runtime of the respective versions.
@m-reuter there is also a question of whether we want to even include and distribute the "legacy algorithms". I suspect, including them in one commit but removing them again might be an option, but I do not expect we want to add numba as a dependency (which is really required for acceptable run times).