-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add int8 gemm recipe #1614
Add int8 gemm recipe #1614
Conversation
c9effa0
to
14ba4df
Compare
Thanks for sharing the experiences.
|
topi/recipe/gemm/gemm_int8.py
Outdated
BL = s.cache_read(BB, 'local', [C]) | ||
CC = s.cache_write(C, 'local') | ||
|
||
dot = intrin_dot() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this declaration out of template. This can accelerate the feature extraction in tuning
14ba4df
to
ae45669
Compare
ae45669
to
3c54fe2
Compare
@merrymercy Thanks for your comments. |
@merrymercy can you explicitly approve or suggest future comments? |
Thanks @vinx13 @merrymercy this is now merged |
This PR adds int8 gemm recipe tuned with AutoTVM to topi.
Some interesting facts
AutoTVM: Using AutoTVM to tune all tile sizes to unlikely to produce the best config because the config space is too huge. Narrowing the search space by adding a few constraints (e.g. removing too big / small sizes) speeds up tuning. The performance after 1000 trials is very close to the best performance I tuned manually.
Most performance gain is achieved by optimizing memory accesses. For example, using virtual threads (specifically 8x8 or 16x4 vthreads in this case). There are still a few bank conflicts not resolved. Bank conflicts when transferring data from shared memory to local memory cannot be resolved using storage alignment because int8x16 elements are loaded to shared memory from global memory. This pattern requires data to be aligned by 16 bytes and therefore I use 48 in
storage_align
which may be less helpful than a prime. Loading four int8 at a time solves the alignment constraint but is much slower.Double buffering: The effect of double buffering is related to the block size. Sometimes it can be slower because of increased shared memory size or the number of registers used.
Shuffling: I tried to use the shuffle instruction instead of shared memory (using
cache_read
with warp scope) but did not achieve better performance. Also this imposes a constraint on thread numbers (currently TVM requires extent of threadIdx.x to be 32) and makes it less flexible.cubin v.s. ptx: There are no preference for either one since they shows competing performance.
nvprof shows that some best config from AutoTVM uses too many registers. Building with
-maxrregcount
option can help (but the performance improvement is very small). This requires a custom cuda_compile callback. Since there is already one registered by AutoTVM, we need to forcedly register another callback.It may be helpful to reorder the reduction in different threads. It shows ~2TOPS performance gain after manually changing generated CUDA code. But currently this is not supported.
The best performance tested on GTX1080 is ~21TOPS, while the speed of cuBLAS is ~29TOPS.
cc @tqchen @merrymercy