Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ONNX] Add MatMulInteger importer #10450

Merged
merged 10 commits into from
Apr 13, 2022
Merged

Conversation

anwang2009
Copy link
Contributor

As title. Re-uses QLinearMatMul's implementation. The only difference is that MatMulInteger accumulates into int32, not int8 or uint8.

cc @cconvey @AndrewZhaoLuo


@classmethod
def _impl_v10(cls, inputs, attr, params):
a = inputs[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so pretty sure you can just do relay.nn.matmul(..., out_dtype='int32'). You don't need all of this QLinearMatMul stuff to handle accumulation without overflow imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If scale is always 1 and zero point is always 0, I think that's true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression was that MatMulInteger exactly corresponds to our qnn.dense. We can use QLinearMatMul converter but we want to skip requantize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/onnx/onnx/blob/main/docs/Operators.md#matmulinteger <-- I was fixated on the line where it said it was identical to numpy's. It appears the zero points are not fixed always so using QLinearMatMul is probably best choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified QLinearMatMul converter to skip requantize in the MatMulInteger case

b_scale = _op.const(1.0, dtype="float32")
out_scale = _op.const(1.0, dtype="float32")

a_zero_point = _op.const(0.0, dtype=a_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero point does not seem to be fixed to 0? The scale points appear to be not present though which is weird. You might have to read the codebase and see what the intention really is since the documentation is confusing (and my guess has a mistake in it): https://github.com/onnx/onnx/blob/main/docs/Operators.md#matmulinteger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah zp isn't necessarily fixed to 0, but the docs say 0 should be the default. I do this a few lines down:

        if len(inputs) == 4:
            a_zero_point = inputs[2]
            b_zero_point = inputs[3]

From the example it looks like the scale factors are 1 by default.

@anwang2009
Copy link
Contributor Author

It looks like the onnx importer test is failing on test_matmulinteger for cuda, since the cuda schedule generation for dense_int8 relies on the inner dimension of the multiply being a multiple of 4. The default onnx test's inner dimension is 3, and there is no generic schedule for non-4-multiples to fall back to. Should I close this PR until this issue can be addressed?

@masahi
Copy link
Member

masahi commented Mar 3, 2022

I think there should be dense legalize or something that adds padding and slicing to handle such cases

):
# no need to pad
return None

candidates = [(16, 16, 16), (32, 16, 8), (8, 16, 32)]
candidates = [(16, 16, 16), (32, 16, 8), (8, 16, 32), (4, 4, 4)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL @AndrewZhaoLuo @masahi

In particular this allows tighter padding boxes in order to enable onnx cuda tests where the shapes are on the order of (2, 3) x (3, 4), because padding fails if the padding boxes are not densely populated enough with real data. I imagine one potential downside is that larger tensors might be computed faster in broad sweeps with 16x16x16 padding vs the finer grained 4x4x4, but I have no concrete evidence either way.

wdyt, are we ok with adding the 4x4x4 padding target?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try decoupling (4, 4, 4) padding from tensorcore stuff in this file? The shape (4, 4, 4) is invalid for tensorcore.

For int8, I think rather than hard-coding (4, 4, 4) (which doesn't work for other shape like 13), I think the right solution is to "pad to the nearest multiple of 4 greater than the given dim".

extra_flops_ratio = _extra_flops(M, K, N, dm, dk, dn) / (M * K * N)

if extra_flops_ratio > 2:
return None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi I moved the 4x4x4 treatment to this block. Is this what you meant or did you want the 4x4x4 logic outside of tensorcore_alter_op.py entirely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think this is good. I don't like this file being named tensorcore_alter_op.py but that's a different story.

But I think this code block is only useful for int8. So probably we need if dtype in ["int8", "uint8"] check? Also I'm not clear what happens if padding is necessary but extra_flops_ratio > 2 says True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please do not nest if extra_flops_ratio > 2: check, since it's a bit weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I suggest something like this:

do_pad_for_int8 = ...

if extra_flops_ratio > 2 and not do_pad_for_int8:
     logger.info(...)
     return None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the return None to be ordered after all the logic for dense slightly differently, PTAL

# The shape of (M, K, N) must be multiple of (16, 16, 16) or (32, 16, 8) or (8, 16, 32)
# The shape of (M, K, N) must be multiple of
# (16, 16, 16) or (32, 16, 8) or (8, 16, 32)
# from https://arxiv.org/pdf/1811.09736.pdf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this link? This is a well-known HW fact.

@masahi masahi merged commit 597000c into apache:main Apr 13, 2022
AndrewZhaoLuo added a commit to AndrewZhaoLuo/tvm that referenced this pull request Apr 15, 2022
* main: (527 commits)
  [hexagon] 'add_hvx' test to explore HVX usage. (apache#10604)
  [COMMUNITY] @yzh119 -> Reviewer (apache#10993)
  [Metaschedule] Make custom schedule_rule registration optional (apache#10975)
  [ONNX] Add imports for BERT contrib operators (apache#10949)
  sort axes (apache#10985)
  [Hexagon] Remove HexagonBuffer external constructor and support (apache#10978)
  [CI] Update GPU image (apache#10992)
  [Runtime][Vulkan] Add RGP support to TVM for vulkan device (apache#10953)
  [FIX] resolve int64/32 for AttrStmtNode (apache#10983)
  [TVMC] Allow output module name to be passed as a command line argument (apache#10962)
  [ONNX] Add MatMulInteger importer (apache#10450)
  [COMMUNITY] @guberti -> Reviewer (apache#10976)
  Support `qnn.conv2d` in FoldExplicitPading (apache#10982)
  change Hexagon docker version (apache#10981)
  remove exception handling of autotvm xgboost extract functions (apache#10948)
  [CUDNN] Add partitioning support for conv2d and log_softmax (apache#10961)
  [Hexagon][LLVM] Enable/test tensorized Hexagon DMA on 2d transformed layout (apache#10905)
  [Hexagon] Move aot/graph_executor interactions into launcher (apache#10907)
  [HEXAGON] Split huge 1D DMA Transfers into smaller transfers with legal sizes. (apache#10971)
  [CI][DOCKER] Add pytest-lazy-fixture to images (apache#10970)
  ...
Lucien0 pushed a commit to Lucien0/tvm that referenced this pull request Apr 19, 2022
* implement matmulinteger

* rm test

* rm outdated comments

* fix lint and review

* wip

* fixes

* fix

* alter tests

* extra 4x4x4 step

* comments
altanh pushed a commit to altanh/tvm that referenced this pull request Apr 28, 2022
* implement matmulinteger

* rm test

* rm outdated comments

* fix lint and review

* wip

* fixes

* fix

* alter tests

* extra 4x4x4 step

* comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants