Add int8 gemm recipe #1614

vinx13 · 2018-08-17T03:42:24Z

This PR adds int8 gemm recipe tuned with AutoTVM to topi.

Some interesting facts

AutoTVM: Using AutoTVM to tune all tile sizes to unlikely to produce the best config because the config space is too huge. Narrowing the search space by adding a few constraints (e.g. removing too big / small sizes) speeds up tuning. The performance after 1000 trials is very close to the best performance I tuned manually.
Most performance gain is achieved by optimizing memory accesses. For example, using virtual threads (specifically 8x8 or 16x4 vthreads in this case). There are still a few bank conflicts not resolved. Bank conflicts when transferring data from shared memory to local memory cannot be resolved using storage alignment because int8x16 elements are loaded to shared memory from global memory. This pattern requires data to be aligned by 16 bytes and therefore I use 48 in storage_align which may be less helpful than a prime. Loading four int8 at a time solves the alignment constraint but is much slower.
Double buffering: The effect of double buffering is related to the block size. Sometimes it can be slower because of increased shared memory size or the number of registers used.
Shuffling: I tried to use the shuffle instruction instead of shared memory (using cache_read with warp scope) but did not achieve better performance. Also this imposes a constraint on thread numbers (currently TVM requires extent of threadIdx.x to be 32) and makes it less flexible.
cubin v.s. ptx: There are no preference for either one since they shows competing performance.
nvprof shows that some best config from AutoTVM uses too many registers. Building with -maxrregcount option can help (but the performance improvement is very small). This requires a custom cuda_compile callback. Since there is already one registered by AutoTVM, we need to forcedly register another callback.
It may be helpful to reorder the reduction in different threads. It shows ~2TOPS performance gain after manually changing generated CUDA code. But currently this is not supported.

The best performance tested on GTX1080 is ~21TOPS, while the speed of cuBLAS is ~29TOPS.

cc @tqchen @merrymercy

merrymercy · 2018-08-17T09:57:22Z

Thanks for sharing the experiences.

You can register your own callback by overriding forcedly

@register_func(override=True)
def tvm_callback_cuda_compile(code):
    pass

By default this script should use a pre-tuned parameter, so users can test the performance directly. Then you can provide a do_tuning flag for users who are interested in tuning for other shapes or devices.
Did you try other shapes?

merrymercy · 2018-08-17T20:53:18Z

topi/recipe/gemm/gemm_int8.py

+    BL = s.cache_read(BB, 'local', [C])
+    CC = s.cache_write(C, 'local')
+
+    dot = intrin_dot()


Move this declaration out of template. This can accelerate the feature extraction in tuning

vinx13 · 2018-08-20T08:51:44Z

@merrymercy Thanks for your comments.
I added a DO_TUNING flag and PRETUNED_INDEX.
I've also tried other shapes, they yield similar results and difference to cuBLAS. Some shapes may be 5% ~ 10% slower than this one.

tqchen · 2018-08-21T01:08:48Z

@merrymercy can you explicitly approve or suggest future comments?

tqchen · 2018-08-21T17:40:33Z

Thanks @vinx13 @merrymercy this is now merged

tqchen added the status: need review label Aug 17, 2018

vinx13 force-pushed the recipe/gemm_int8 branch 2 times, most recently from c9effa0 to 14ba4df Compare August 17, 2018 05:15

merrymercy requested changes Aug 17, 2018

View reviewed changes

Add int8 gemm recipe

6392a5b

vinx13 force-pushed the recipe/gemm_int8 branch from 14ba4df to ae45669 Compare August 20, 2018 07:42

Add pretuned config

3c54fe2

vinx13 force-pushed the recipe/gemm_int8 branch from ae45669 to 3c54fe2 Compare August 20, 2018 07:47

tqchen self-assigned this Aug 21, 2018

merrymercy approved these changes Aug 21, 2018

View reviewed changes

tqchen merged commit 21e1301 into apache:master Aug 21, 2018

tqchen added status: accepted and removed status: need review labels Aug 21, 2018

tqchen added this to the v0.5 milestone Aug 21, 2018

tqchen mentioned this pull request Aug 21, 2018

TVM v0.5 Roadmap #1596

Closed

32 tasks

FrozenGene pushed a commit to FrozenGene/tvm that referenced this pull request Dec 27, 2018

Add int8 gemm recipe (apache#1614)

c9c7d18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add int8 gemm recipe #1614

Add int8 gemm recipe #1614

vinx13 commented Aug 17, 2018 •

edited

Loading

merrymercy commented Aug 17, 2018 •

edited

Loading

merrymercy Aug 17, 2018

vinx13 commented Aug 20, 2018 •

edited

Loading

tqchen commented Aug 21, 2018

tqchen commented Aug 21, 2018

Add int8 gemm recipe #1614

Add int8 gemm recipe #1614

Conversation

vinx13 commented Aug 17, 2018 • edited Loading

Some interesting facts

merrymercy commented Aug 17, 2018 • edited Loading

merrymercy Aug 17, 2018

Choose a reason for hiding this comment

vinx13 commented Aug 20, 2018 • edited Loading

tqchen commented Aug 21, 2018

tqchen commented Aug 21, 2018

vinx13 commented Aug 17, 2018 •

edited

Loading

merrymercy commented Aug 17, 2018 •

edited

Loading

vinx13 commented Aug 20, 2018 •

edited

Loading