This is a direct implementation of Radix Sort algorithm described here:
Marcho Zagha and Guy E. Blelloch. “Radix Sort For Vector Multiprocessor.”
Conference on High Performance Networking and Computing, pp. 712-721, 1991.
with ideas ( transpose func, some kernel tricks ) from here:
Philippe HELLUY, A portable implementation of the radix sort algorithm in OpenCL, HAL 2011.