-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ONNX] Add MatMulInteger importer #10450
Conversation
|
||
@classmethod | ||
def _impl_v10(cls, inputs, attr, params): | ||
a = inputs[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so pretty sure you can just do relay.nn.matmul(..., out_dtype='int32')
. You don't need all of this QLinearMatMul stuff to handle accumulation without overflow imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If scale is always 1 and zero point is always 0, I think that's true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression was that MatMulInteger
exactly corresponds to our qnn.dense
. We can use QLinearMatMul
converter but we want to skip requantize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/onnx/onnx/blob/main/docs/Operators.md#matmulinteger <-- I was fixated on the line where it said it was identical to numpy's. It appears the zero points are not fixed always so using QLinearMatMul is probably best choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified QLinearMatMul
converter to skip requantize in the MatMulInteger case
b_scale = _op.const(1.0, dtype="float32") | ||
out_scale = _op.const(1.0, dtype="float32") | ||
|
||
a_zero_point = _op.const(0.0, dtype=a_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zero point does not seem to be fixed to 0? The scale points appear to be not present though which is weird. You might have to read the codebase and see what the intention really is since the documentation is confusing (and my guess has a mistake in it): https://github.com/onnx/onnx/blob/main/docs/Operators.md#matmulinteger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah zp isn't necessarily fixed to 0, but the docs say 0 should be the default. I do this a few lines down:
if len(inputs) == 4:
a_zero_point = inputs[2]
b_zero_point = inputs[3]
From the example it looks like the scale factors are 1 by default.
It looks like the onnx importer test is failing on |
I think there should be dense legalize or something that adds padding and slicing to handle such cases |
b073ea1
to
556cb4e
Compare
4097eb7
to
ec940d4
Compare
): | ||
# no need to pad | ||
return None | ||
|
||
candidates = [(16, 16, 16), (32, 16, 8), (8, 16, 32)] | ||
candidates = [(16, 16, 16), (32, 16, 8), (8, 16, 32), (4, 4, 4)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL @AndrewZhaoLuo @masahi
In particular this allows tighter padding boxes in order to enable onnx cuda tests where the shapes are on the order of (2, 3) x (3, 4), because padding fails if the padding boxes are not densely populated enough with real data. I imagine one potential downside is that larger tensors might be computed faster in broad sweeps with 16x16x16 padding vs the finer grained 4x4x4, but I have no concrete evidence either way.
wdyt, are we ok with adding the 4x4x4 padding target?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try decoupling (4, 4, 4)
padding from tensorcore stuff in this file? The shape (4, 4, 4)
is invalid for tensorcore.
For int8, I think rather than hard-coding (4, 4, 4)
(which doesn't work for other shape like 13), I think the right solution is to "pad to the nearest multiple of 4 greater than the given dim".
ec940d4
to
749cc16
Compare
extra_flops_ratio = _extra_flops(M, K, N, dm, dk, dn) / (M * K * N) | ||
|
||
if extra_flops_ratio > 2: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masahi I moved the 4x4x4 treatment to this block. Is this what you meant or did you want the 4x4x4 logic outside of tensorcore_alter_op.py
entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think this is good. I don't like this file being named tensorcore_alter_op.py
but that's a different story.
But I think this code block is only useful for int8. So probably we need if dtype in ["int8", "uint8"]
check? Also I'm not clear what happens if padding is necessary but extra_flops_ratio > 2
says True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And please do not nest if extra_flops_ratio > 2:
check, since it's a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I suggest something like this:
do_pad_for_int8 = ...
if extra_flops_ratio > 2 and not do_pad_for_int8:
logger.info(...)
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored the return None to be ordered after all the logic for dense slightly differently, PTAL
# The shape of (M, K, N) must be multiple of (16, 16, 16) or (32, 16, 8) or (8, 16, 32) | ||
# The shape of (M, K, N) must be multiple of | ||
# (16, 16, 16) or (32, 16, 8) or (8, 16, 32) | ||
# from https://arxiv.org/pdf/1811.09736.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this link? This is a well-known HW fact.
* main: (527 commits) [hexagon] 'add_hvx' test to explore HVX usage. (apache#10604) [COMMUNITY] @yzh119 -> Reviewer (apache#10993) [Metaschedule] Make custom schedule_rule registration optional (apache#10975) [ONNX] Add imports for BERT contrib operators (apache#10949) sort axes (apache#10985) [Hexagon] Remove HexagonBuffer external constructor and support (apache#10978) [CI] Update GPU image (apache#10992) [Runtime][Vulkan] Add RGP support to TVM for vulkan device (apache#10953) [FIX] resolve int64/32 for AttrStmtNode (apache#10983) [TVMC] Allow output module name to be passed as a command line argument (apache#10962) [ONNX] Add MatMulInteger importer (apache#10450) [COMMUNITY] @guberti -> Reviewer (apache#10976) Support `qnn.conv2d` in FoldExplicitPading (apache#10982) change Hexagon docker version (apache#10981) remove exception handling of autotvm xgboost extract functions (apache#10948) [CUDNN] Add partitioning support for conv2d and log_softmax (apache#10961) [Hexagon][LLVM] Enable/test tensorized Hexagon DMA on 2d transformed layout (apache#10905) [Hexagon] Move aot/graph_executor interactions into launcher (apache#10907) [HEXAGON] Split huge 1D DMA Transfers into smaller transfers with legal sizes. (apache#10971) [CI][DOCKER] Add pytest-lazy-fixture to images (apache#10970) ...
* implement matmulinteger * rm test * rm outdated comments * fix lint and review * wip * fixes * fix * alter tests * extra 4x4x4 step * comments
* implement matmulinteger * rm test * rm outdated comments * fix lint and review * wip * fixes * fix * alter tests * extra 4x4x4 step * comments
As title. Re-uses
QLinearMatMul
's implementation. The only difference is that MatMulInteger accumulates into int32, not int8 or uint8.cc @cconvey @AndrewZhaoLuo