-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Embedding gradient performance optimization on GPU #16355
Conversation
@sxjscience FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Nice! LGTM. So the BinarySearch version of FindBounds has complexity O(|V| log |N|) where |V| is the vocabulary size and |N| is the number of indices. I guess our initial version (https://github.com/dmlc/mshadow/blob/bc49327a44650c3f2b427e953ff95d2c27566c04/mshadow/cuda/tensor_gpu-inl.cuh#L619-L672) has complexity O(|N|) for finding the boundaries. Thus, in some workloads (in which |N| is small), the O(N) version might be faster. |
Moises did a performance comparison between the new version and both the old one and the old buggy one. The new kernel is faster than the old working version in all cases and ~same in speed as the buggy one. The biggest performance change is seen actually when changing how many different elements are in the input data (as small number of distinct elements limits parallelism in the backward pass). |
* Add Embedding backward Op for GPU * Add some code documentation * Use unnamed namespace for integer log2 function * Fix lint issues * Fix one more lint problem * Remove unnecessary conditions ops * Fix one more lint problem
Description
This PR includes a specific Embedding-backward operator for GPU.
Two new CUDA kernels have been implemented for improving the performance of the operator when using GPU.
According to our measurements on Volta GPUs, the previous version was taken 2.2ms,
whereas the new implementation takes 0.3ms, i.e. more than 7x speedup.
Checklist
Essentials
Changes