[MXNET-1446] Quantization: intgemm matrix multiply wrappers #17559

kpuatamazon · 2020-02-10T15:19:11Z

Description

This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm .

A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59

The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 .

Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything.

intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference.

On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0.

Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take.

Add 128 to data so now it's unsigned.  But that biases the output.  DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM.  intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime.  A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. 
Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided.

Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2.

Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR).
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

submodule for intgemm
intgemm_prepare_data and intgemm_prepare_weight operators to convert operands from fp32
intgemm_take_weight for taking weights in intgemm's weight format, which is useful for vocabulary shortlists in Sockeye.
intgemm_fully_connected for matrix multiply

Comments

Backward compatible.
intgemm requires the inner dimension be a multiple of 64 for efficiency and alignment reasons. Currently the outputs must be a multiple of 8 but there is in-progress code in intgemm to remove that.

import mxnet as mx a = mx.nd.random_uniform(low=-1.0, high=1.0, shape=[5, 64]) b = mx.nd.random_uniform(low=-1.0, high=1.0, shape=[8, 64]) b_scale = 127.0 / mx.nd.contrib.intgemm_maxabsolute(b).asscalar() b_prepared = mx.nd.contrib.intgemm_prepareb(b, multiplier = b_scale) mx.nd.FullyConnected(a, b, num_hidden=8, no_bias=True, flatten=False) mx.nd.contrib.intgemm_fully_connected(a, b_prepared, out_float_multiplier=1.0/b_scale, num_hidden=8, no_bias=True, flatten=False)

…lack VNNI support yet. This reverts commit 947f911.

…to intgemm

…izedTransposed. This will make it easier to store a consistent file on disk.

szha · 2020-08-20T18:32:08Z

cc @leezu to review the build logic.

tests/python/unittest/test_contrib_intgemm.py

szha · 2020-08-20T20:46:12Z

Otherwise LGTM. I reviewed tests and and op implementation.

CMakeLists.txt

@kpuatamazon

addressed concerns. thanks @kpuatamazon!

kpuatamazon · 2020-08-28T17:21:07Z

@leezu Ready?

CMakeLists.txt

…m CMakeLists.txt This reverts commit a5a441e.

kpuatamazon · 2020-08-31T12:32:31Z

@mxnet-bot run ci [unix-gpu]
Looks like an unrelated test failure

mxnet-bot · 2020-08-31T12:32:37Z

Jenkins CI successfully triggered : [unix-gpu]

leezu · 2020-08-31T17:08:34Z

Thank you @kpuatamazon

* cherry-pick intgemm from master, fix build * Fix test to conform to 1.x * Makefile supporting intgemm compilation * Stricter dependencies on git checkout of intgemm * Operators depend on mkldnn * Don't compile intgemm with gcc older than 5 * Fix intgemm test for windows on 1.x by not using pytest * Update intgemm to use template arguments for integer immediates * Try to fix clang3.6 * Ban gcc < 5 in cmake * Update intgemm with gcc 5.5 debug workaround

Kenneth Heafield added 29 commits November 28, 2019 15:41

Add intgemm as a submodule

1320de3

Update to remove DEFAULT macro

0c68e33

Add intgemm to CMake

3bf28e5

Operator working for PrepareB

5b01d0b

Consolidate CPU inline into cc since there's only one dispatch

0d7b54a

intgemm MaxAbsolute operator

88fb3a5

Update to latest intgemm

897bf6e

Remove trailing whitespace

b65e33f

Extract common code from Prepare* operations

b615ee8

Disable in-place, zero gradients following existing quantization code

ed6be7e

Remove commented out parameter

153a628

Better documentation/parameter naming for intgemm fully connected

f1cd4ab

Rename preparea to prepare_data, prepareb to prepare_weight

8b5d107

Allow all request types for max_absolute

6e801f4

Clarify error message

f492f26

Add operator to slice a B matrix

7a02d05

Update intgemm with VNNI

947f911

Revert "Update intgemm with VNNI". It's not ready for compilers that …

b28c699

…lack VNNI support yet. This reverts commit 947f911.

Remove op suffix on intgemm_take_weight

8f7deb6

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

502bcf5

…to intgemm

Update intgemm

63c1a3b

Merge remote-tracking branch 'origin/master' into intgemm

409fe0e

Update intgemm

d777fed

Merge branch 'master' into intgemm

c55076c

Refactor prepare operations to take scaling as tensors. PrepareBQuant…

c6b47a1

…izedTransposed. This will make it easier to store a consistent file on disk.

Remove unused variable

07cf577

Merge remote-tracking branch 'origin/master' into intgemm

c7dab72

Merge remote-tracking branch 'origin' into intgemm

389d7e3

kpuatamazon requested a review from szha as a code owner February 10, 2020 15:19

Fix flaky test whereby 0.5 could round either way

98588da

szha reviewed Aug 20, 2020

View reviewed changes

tests/python/unittest/test_contrib_intgemm.py Show resolved Hide resolved

Kenneth Heafield added 4 commits August 24, 2020 09:30

Merge https://github.com/apache/incubator-mxnet into intgemm

1952b9f

Add npx aliases

a436cbd

Update tests to support numpy, refactor to pytest.mark.parametrize

8e6739b

Remove transpose

29cc970

leezu reviewed Aug 25, 2020

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

Kenneth Heafield added 5 commits August 28, 2020 11:48

Merge branch 'master' into intgemm

524b79a

gcc7 is already required. You don't need any special handling here.

d03342a

EXCLUDE_FROM_ALL

d7a8ef4

Change to downloaded intgemm

a5a441e

Change intgemm.cc to linked library

33ad782

leezu reviewed Aug 28, 2020

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

CMakeLists.txt Outdated Show resolved Hide resolved

Kenneth Heafield added 6 commits August 28, 2020 18:06

Use target_link_libraries to pick up intgemm compilation test header

03732c7

Change to a cmake_dependent_option

8aaa23c

Revert "Change to downloaded intgemm" and remove header reference fro…

8ac7fe6

…m CMakeLists.txt This reverts commit a5a441e.

Change to #include <intgemm/intgemm.h>

e6ddba8

Merge branch 'master' into intgemm

4578c8d

Fetch intgemm in build

7e7b0c2

leezu merged commit 1393602 into apache:master Aug 31, 2020

kpuatamazon mentioned this pull request Sep 7, 2020

[1.x] Backport of intgemm #17559 #19099

Merged

4 tasks

leezu mentioned this pull request Nov 9, 2020

Disable ENABLE_TESTCOVERAGE on CentOS 7 build #19507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-1446] Quantization: intgemm matrix multiply wrappers #17559

[MXNET-1446] Quantization: intgemm matrix multiply wrappers #17559

kpuatamazon commented Feb 10, 2020

szha commented Aug 20, 2020

szha commented Aug 20, 2020

kpuatamazon commented Aug 28, 2020

kpuatamazon commented Aug 31, 2020

mxnet-bot commented Aug 31, 2020

leezu commented Aug 31, 2020

[MXNET-1446] Quantization: intgemm matrix multiply wrappers #17559

[MXNET-1446] Quantization: intgemm matrix multiply wrappers #17559

Conversation

kpuatamazon commented Feb 10, 2020

Description

Checklist

Essentials

Changes

Comments

szha commented Aug 20, 2020

szha commented Aug 20, 2020

kpuatamazon commented Aug 28, 2020

kpuatamazon commented Aug 31, 2020

mxnet-bot commented Aug 31, 2020

leezu commented Aug 31, 2020