Avoid loops in s2fft.sampling.reindex
functions to reduce compile and run times
#245
+22
−34
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The index conversion functions in
s2fft.sampling.reindex
currently use Python loops (overL
iterations) to perform the reindexing operations slice by slice. As these loops are unrolled when JIT compiling this leads to long compile times asL
gets bigger.This PR refactors the reindexing functions to instead construct arrays of indices and use these to gather or scatter the relevant entries in single indexing operations, thus avoiding the loops. For the functions converting between HEALPix and 2D harmonic coefficient indexing, we can use the$\mathcal{O}(L^2)$ in size - however this is no larger than the memory requirements of the harmonic coefficient arrays themselves, and the computational graphs of the resulting compiled functions are much smaller (no longer scaling with
numpy.triu_indices
function to construct the relevant index arrays. For the maps between 1D and 2D layouts we can usenumpy.where
on a boolean array constructed so that the relevant array elements areTrue
. In both cases the explicitly constructed index arrays add some memory overhead - as they depend onlyL
which is static they will be constants at compile time, and in all cases are I believeL
) so its unclear if overall the memory requirements are much larger.I've done some benchmarking of the new implementations here compared to the current implementations and in all cases tested the compile times are significantly lower and the run times comparable or significantly better:
https://gist.github.com/matt-graham/7a2b5b77f51b5301b8910ced176d8a8a
For example for the
flm_2d_to_hp_fast
function, the following plot shows the compile times of the current implementation ("Current"), the proposed implementation ("triu_indices based") and two other alternatives I tried out ("Slice based" and "Nested fori_loop") for variousL
valuesand the corresponding plot for run times of the compiled functions
On both compile and run times the proposed "triu_indices based" implementation is significantly quicker than both the current implementation and the other alternatives considered.
EDIT: There was a slight bug in the
flm_hp_to_2d_fast
reimplementation now hopefully fixed. The run times for the new function are now a little slower but the compile time is still much better than current implementation and importantly not quickly growing withL
.