Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add int8 gemm recipe #1614

Merged
merged 2 commits into from
Aug 21, 2018
Merged

Add int8 gemm recipe #1614

merged 2 commits into from
Aug 21, 2018

Conversation

vinx13
Copy link
Member

@vinx13 vinx13 commented Aug 17, 2018

This PR adds int8 gemm recipe tuned with AutoTVM to topi.

Some interesting facts

  • AutoTVM: Using AutoTVM to tune all tile sizes to unlikely to produce the best config because the config space is too huge. Narrowing the search space by adding a few constraints (e.g. removing too big / small sizes) speeds up tuning. The performance after 1000 trials is very close to the best performance I tuned manually.

  • Most performance gain is achieved by optimizing memory accesses. For example, using virtual threads (specifically 8x8 or 16x4 vthreads in this case). There are still a few bank conflicts not resolved. Bank conflicts when transferring data from shared memory to local memory cannot be resolved using storage alignment because int8x16 elements are loaded to shared memory from global memory. This pattern requires data to be aligned by 16 bytes and therefore I use 48 in storage_align which may be less helpful than a prime. Loading four int8 at a time solves the alignment constraint but is much slower.

  • Double buffering: The effect of double buffering is related to the block size. Sometimes it can be slower because of increased shared memory size or the number of registers used.

  • Shuffling: I tried to use the shuffle instruction instead of shared memory (using cache_read with warp scope) but did not achieve better performance. Also this imposes a constraint on thread numbers (currently TVM requires extent of threadIdx.x to be 32) and makes it less flexible.

  • cubin v.s. ptx: There are no preference for either one since they shows competing performance.

  • nvprof shows that some best config from AutoTVM uses too many registers. Building with -maxrregcount option can help (but the performance improvement is very small). This requires a custom cuda_compile callback. Since there is already one registered by AutoTVM, we need to forcedly register another callback.

  • It may be helpful to reorder the reduction in different threads. It shows ~2TOPS performance gain after manually changing generated CUDA code. But currently this is not supported.

The best performance tested on GTX1080 is ~21TOPS, while the speed of cuBLAS is ~29TOPS.

cc @tqchen @merrymercy

@vinx13 vinx13 force-pushed the recipe/gemm_int8 branch 2 times, most recently from c9effa0 to 14ba4df Compare August 17, 2018 05:15
@merrymercy
Copy link
Member

merrymercy commented Aug 17, 2018

Thanks for sharing the experiences.

  1. You can register your own callback by overriding forcedly
@register_func(override=True)
def tvm_callback_cuda_compile(code):
    pass
  1. By default this script should use a pre-tuned parameter, so users can test the performance directly. Then you can provide a do_tuning flag for users who are interested in tuning for other shapes or devices.

  2. Did you try other shapes?

BL = s.cache_read(BB, 'local', [C])
CC = s.cache_write(C, 'local')

dot = intrin_dot()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this declaration out of template. This can accelerate the feature extraction in tuning

@vinx13
Copy link
Member Author

vinx13 commented Aug 20, 2018

@merrymercy Thanks for your comments.
I added a DO_TUNING flag and PRETUNED_INDEX.
I've also tried other shapes, they yield similar results and difference to cuBLAS. Some shapes may be 5% ~ 10% slower than this one.

@tqchen tqchen self-assigned this Aug 21, 2018
@tqchen
Copy link
Member

tqchen commented Aug 21, 2018

@merrymercy can you explicitly approve or suggest future comments?

@tqchen tqchen merged commit 21e1301 into apache:master Aug 21, 2018
@tqchen
Copy link
Member

tqchen commented Aug 21, 2018

Thanks @vinx13 @merrymercy this is now merged

@tqchen tqchen added this to the v0.5 milestone Aug 21, 2018
@tqchen tqchen mentioned this pull request Aug 21, 2018
32 tasks
FrozenGene pushed a commit to FrozenGene/tvm that referenced this pull request Dec 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants