[Topi] Cortex-M DSP support #9233

sergio-grovety · 2021-10-08T19:48:46Z

TVM operations implementation using Cortex-M DSP instructions

nn.conv2d

We added the implementation of gemm function with 16-bit input
shape[-1] multiple of 4 restriction is resolved
There is data preparing before 8-bit intrinsic, such preparation will consume too much time in case of small tensor, so we add a check and simple cycle to handle this specific situation
In terms of optimization - calculations moved from inside of the intrinsic to outside
One of the buffers was radically cut(wasn't in use), lead to reducing memory requirements mostly in a half

nn.max_pool2d

Implemented with __SSUB8 and __SEL intrinsics for four 8-bit input values, which is lead to notable acceleration
Feature: implementation ready for not 1word-aligned input data
Feature: ready for data sizes not a multiple of 4
memset is used to initialize the minimum values, to provide max speed

nn.avg_pool2d

Due to lack of sum of four 8-bit values intrinsic - implementation could be possible only for 16-bit data
__SMLAD intrinsic used to process two 16-bit values
Feature: implementation ready for not 1word-aligned input data
Feature: ready for data sizes not a multiple of 4

nn.dense

Implemented with same gemm method, described above

nn.conv1d

Specific case of gemm usage - with one of data dimensions equal to 1

nn.avg_pool1d

Implemented for NCW layout with same intrinsic as for 2d version of operation

nn.max_pool1d

Implemented with same intrinsic as for 2d version of operation

Benchmarking:

To enable intrinsic code generation you should specify -mcpu=cortex-m7 flag
HW platform: STM32F746 Nucleo; GCC10, optimization flags: -O3
If you want to enable intrinsic you should specify -march parameter of the target:
target_str = f"c -keys=arm_cpu -mcpu=cortex-m7 -march=armv7e-m -model=stm32f746xx -runtime=c -link-params=1 --executor=aot --unpacked-api=1 --interface-api=c"

Results

ms	No Intrinsic	Intrinsic enabled
mnist8	8.625	6.574
cifar10	788.36	144.59

No intrinsic: march parameter not specified, no code generated for Intrinsic
Intrinsic enabled: march=armv7e-m

…nel.

…\gemm.py). Add 16bit variant. Turn off AutoTVM tuning for test (python\tvm\topi\arm_cpu\cortex_m7\conv2d\direct_simd.py).

Use int8 for cortex-m7 until universal gemm implementation for int8/int16

…rm-schedules

Add base of avg_pool intrinsic (fast sum of 16-bit array - only C code).

areusch · 2021-10-18T18:06:50Z

@u99127 would you like to take a look? also, do you have any suggestions as to how to implement the requires_corstone300 primitive? a simple implementation could be to add a pytest flag --run-corstone300-tests=/path/to/opt/arm/ethosu but am curious if there is a better way. the flag method requires us to make a Jenkinsfile change to differentiate between ci-i386 (with no Corstone 300) and ci-cpu (with Corstone 300).

u99127

Firstly thank you for this and this is pretty interesting as it adds support for the DSP extensions in Cortex-M to improve native TVM schedules. Apologies also for the time it has taken to review this from our side.

I went through this today with @Mousius and while I'm out next week I'm happy for you folks to iterate through this.

I'm pretty cool with the implementation and trust that the tests have taken care of all the operators and that things work and it is suitable. We haven't had the time to review the individual operator implementations yet.

Some top level comments.

Could we fix the commit message summary to refer to adding support for the DSP extensions in Cortex-M, the instructions are implemented in the Cortex-M7 but are also available in other CPUs as well and thus by the use of these we should be able to get them elsewhere.
Are the numbers published here are they from running this on Cortex-M7 silicon or on the FVP ? The usage of the FVP allows functional testing of these schedules and we should do that to make sure that these remain correct. I don't expect the performance numbers that come out of the FVP to be meaningful.
Directory structure in the schedules should reflect that this is about DSP extensions and not anything Cortex-M7 specific. So I would suggest something like arm_cpu/mprofile/dsp as the directory structure.
This is a massive Pull request and mixes many things together and I would certainly recommend that this is broken into more manageable chunks to help review and get into the code base.

I'll put most of the other comments individually below where I'd like some clarifications and some changes in this.

Ramana Radhakrishnan

u99127 · 2021-10-21T21:22:36Z

python/tvm/relay/op/strategy/arm_cpu.py

+        if (
+            avg_pool
+            and layout in ("NCW", "NCHW")
+            and "SMLAD" in isa


SMLAD, SSUB8 and SEL are part of the DSP instructions and the presence of one implies the presence of the other. I also think that in this case since we are adding all of these together globbing them into a single check for the use of the DSP extensions should be sufficient. Any reason why we are testing individual instructions ?

i agree with you that we should refactor this. this was left over from the initial implementation which did propose to test for presence of instructions in the ISA; however, you're right that we should just need to determine which architecture is in use. since this PR just adds additional schedules which are purported to be compatible with cortex-m7 devices, perhaps we can address the question of lookup-by-architecture in a follow-on.

u99127 · 2021-10-21T21:23:59Z

python/tvm/target/arm_isa.py


 ARM_ISA_MAP = {
-    "armv7e-m": ["SMLAD"],
+    "armv7e-m": ["SMLAD", "SSUB8", "SEL"],


armv7e-m : DSP ?

@u99127 as discussed, let's punt the architecture labelling to the next PR.

u99127 · 2021-10-21T21:25:02Z

python/tvm/target/arm_isa.py


 ARM_ISA_MAP = {
-    "armv7e-m": ["SMLAD"],
+    "armv7e-m": ["SMLAD", "SSUB8", "SEL"],
+    "armv8-m": ["SMLAD", "SSUB8", "SEL"],


I think what you want is armv8-m.main.

same thing here

u99127 · 2021-10-21T21:25:40Z

python/tvm/testing/utils.py

+    f : function
+        Function to mark
+    """
+    _requires_corstone300 = [pytest.mark.corstone300]


I think we need a better way of controlling this - possibly something @Mousius could comment on here ?

u99127 · 2021-10-21T21:27:43Z

python/tvm/topi/arm_cpu/cortex_m7/conv1d/direct_simd.py

+
+
+def conv1d_nwc_direct_simd(*args, **kwargs):
+    """Defines the Cortex-M7 SIMD implementation of conv1d on NWC layout."""


I think this could well work in general for Armv7em and Armv8m.main and indeed any Cortex-M CPU that implements the DSP instruction set. The biggest win that one would get is in the use of these instructions rather than anything micro-architectural here.

Thus I would suggest trying to model this properly in terms of the ISA .

@Mousius would you have some time to take a look at this ?

yeah i'm up for moving away from isa_analyzer. that was just an initial stab, but i agree that modeling this at the level of architecture rather than instruction makes more sense. however, i think it would be good to do that in a follow-on PR. this PR could then move forward isolated to what was tested already (on STM32F746 nucleo, I believe), and a follow-on could expand support to the broader architecture. what do you think of this?

u99127 · 2021-10-22T13:42:13Z

python/tvm/topi/arm_cpu/cortex_m7/conv1d/__init__.py

+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Conv1d implementations for cortex-m7."""


If these folders exist already probably should be renamed as armv7em/dsp.

Maybe that move is a separate pull request rather than being merged in here.

agree with this, perhaps we do this with the follow-on to expand to architecture?

u99127 · 2021-10-22T13:43:37Z

python/tvm/topi/arm_cpu/cortex_m7/conv1d/direct_simd.py

+
+
+def conv1d_nwc_direct_simd_schedule(cfg, outs):
+    """Schedule function for Cortex-M7 SIMD implementation of conv1d on NWC layout."""


Fix comments to reflect that these are using the DSP extensions rather than SIMD.

Perhaps say :

"Schedule function for v7em DSP instructions of conv1d on NWC layout"

Please audit the whole file for such usage and fix it everywhere.

u99127 · 2021-10-22T13:46:41Z

python/tvm/topi/arm_cpu/dense.py

+# pylint: disable=invalid-name, unused-variable, no-else-return, unused-argument, import-outside-toplevel
+"""Dense schedule for ARM CPU"""
+
+from .cortex_m7.dense import direct_simd


what would happen with this on AArch64 ? Since these schedules are available on both AArch64 and AArch32 ?

This schedule is only for v7e-m strategy (check on isa). On AArch64 "dense.generic" strategy will be chosen.

i think it makes sense then to not import the cortex_m7 direct_simd into this module. can we reorganize as discussed in the earlier thread?

tests/python/conftest.py

u99127 · 2021-10-22T13:49:25Z

tests/python/integration/test_m7_simd.py

+        use_unpacked_api=True,
+        target_opts={
+            "-keys": "arm_cpu",
+            "-march": "armv7e-m",


I'm a bit confused with the use of -march=armv7e-m and not -mcpu here

agreed--i think -mcpu was used to key the IsaAnalyzer, correct?

* removed odd "micro_dev" * fixes for instrinsics' functions comments

areusch · 2021-10-28T18:29:14Z

discussed a bit with @grant-arm and he and @Mousius @manupa-arm will propose a solution to detecting the Corstone-300 FVP binary and ETHOSU_PATH, which are the two things we need to configure based on corstone300.mk.

Mousius

GitHub has helpfully made it difficult to manage the comments in these threads, so I'll place this reply here. Let me know if I missed a specific thread somewhere.

yeah i'm up for moving away from isa_analyzer. that was just an initial stab, but i agree that modeling this at the level of architecture rather than instruction makes more sense. however, i think it would be good to do that in a follow-on PR. this PR could then move forward isolated to what was tested already (on STM32F746 nucleo, I believe), and a follow-on could expand support to the broader architecture. what do you think of this?

I'm fine with splitting out the isa_analyzer into a follow up, it should be trivial to map it to the extensions. However, I'd still advocate for splitting the remaining work into smaller patches which can be reviewed independently. There's a conflation of testing updates, boilerplate for the operators and the operators themselves which we can reason about per operator perhaps?

Also, please update the commit message on this pull request to reflect the DSP extension 😸

Mousius · 2021-10-29T08:53:24Z

python/tvm/topi/arm_cpu/conv1d.py

+
+
+@autotvm.register_topi_compute("conv1d_nwc_direct_simd.arm_cpu")
+def conv1d_nwc_direct_simd(cfg, data, kernel, strides, padding, dilation, out_dtype):


I'd suggested a file structure such as:

arm_cpu/mprofile/dsp/conv1d.py

This leaves room to add other architecture extensions in future rather than stacking them all in one directory.

Mousius · 2021-10-29T15:25:41Z

python/tvm/relay/qnn/op/legalizations.py

@@ -374,6 +374,8 @@ def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
        attrs["kernel_layout"],
        attrs["groups"],
    )
+
+    # Use int8 for Cortex-M7


This is not limited to this CPU?

@sergey-grovety can you revert this comment or fix the set of CPUs indicated?

areusch

thanks @sergey-grovety! i went through the remaining comments. i think given @mehrdadh merged the i386only PR, we can now proceed to merge this once the remaining comments are addressed.

as @Mousius said, let's punt the IsaAnalyzer changes to another PR. however, let's make the organizational changes now so as to avoid confusion in the placement of schedules in topi.

areusch · 2021-11-03T16:43:44Z

python/tvm/target/arm_isa.py


 ARM_ISA_MAP = {
-    "armv7e-m": ["SMLAD"],
+    "armv7e-m": ["SMLAD", "SSUB8", "SEL"],


@u99127 as discussed, let's punt the architecture labelling to the next PR.

areusch · 2021-11-03T16:43:51Z

python/tvm/target/arm_isa.py


 ARM_ISA_MAP = {
-    "armv7e-m": ["SMLAD"],
+    "armv7e-m": ["SMLAD", "SSUB8", "SEL"],
+    "armv8-m": ["SMLAD", "SSUB8", "SEL"],


same thing here

areusch · 2021-11-03T16:45:35Z

python/tvm/target/arm_isa.py

-        # TODO: actually parse -mcpu
-        arch = "armv7e-m"
-        self._isa_map = ARM_ISA_MAP[arch]
+        parser = argparse.ArgumentParser()


you should use the built-in Target parsing logic here rather than argparse:

Suggested change

parser = argparse.ArgumentParser()

target = tvm.target.Target(target)

march = target.attrs.get("-march", None)

self._isa_map = ARM_ISA_MAP[march] if march is not None else []

(also need to delete the following lines 33-36--suggestion didn't quite get the diff)

areusch · 2021-11-03T23:16:28Z

tests/python/conftest.py

+        if config.getoption("--enable-corstone300-tests"):
+            if not "corstone300" in item.keywords:
+                item.add_marker(
+                    pytest.mark.skip(reason="Test should be marked 'corstone300' to run")


i think we just need one skip, right? doesn't this skip all other tests aside from corstone300?

areusch · 2021-11-03T23:22:59Z

python/tvm/relay/op/strategy/arm_cpu.py

+        if (
+            avg_pool
+            and layout in ("NCW", "NCHW")
+            and "SMLAD" in isa


i agree with you that we should refactor this. this was left over from the initial implementation which did propose to test for presence of instructions in the ISA; however, you're right that we should just need to determine which architecture is in use. since this PR just adds additional schedules which are purported to be compatible with cortex-m7 devices, perhaps we can address the question of lookup-by-architecture in a follow-on.

areusch · 2021-11-03T23:23:46Z

python/tvm/relay/qnn/op/legalizations.py

@@ -374,6 +374,8 @@ def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
        attrs["kernel_layout"],
        attrs["groups"],
    )
+
+    # Use int8 for Cortex-M7


@sergey-grovety can you revert this comment or fix the set of CPUs indicated?

areusch · 2021-11-03T23:24:43Z

python/tvm/topi/arm_cpu/conv1d.py

+
+
+@autotvm.register_topi_compute("conv1d_nwc_direct_simd.arm_cpu")
+def conv1d_nwc_direct_simd(cfg, data, kernel, strides, padding, dilation, out_dtype):


mprofile seems good to me.

areusch · 2021-11-03T23:27:11Z

python/tvm/topi/arm_cpu/dense.py

+# pylint: disable=invalid-name, unused-variable, no-else-return, unused-argument, import-outside-toplevel
+"""Dense schedule for ARM CPU"""
+
+from .cortex_m7.dense import direct_simd


i think it makes sense then to not import the cortex_m7 direct_simd into this module. can we reorganize as discussed in the earlier thread?

areusch · 2021-11-03T23:27:43Z

tests/python/integration/test_m7_simd.py

+        use_unpacked_api=True,
+        target_opts={
+            "-keys": "arm_cpu",
+            "-march": "armv7e-m",


agreed--i think -mcpu was used to key the IsaAnalyzer, correct?

files renamed cortex_m7 -> mprofile/dsp

Fixed IsaAnalyzer mcpu detection

areusch · 2021-11-09T23:57:22Z

looks like we are busted at head, #9480 is the fix

…7-intrinsic # Conflicts: # tests/python/conftest.py

areusch

thanks @sergey-grovety !

Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev> Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev> Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev> Co-authored-by: German Tretiakov <german.tretiakov@mir.dev> Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev> Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev> Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>

Sergey Smirnov and others added 30 commits August 23, 2021 10:37

Added scripts to run simple model

ee5c4bf

moved common functions to separate file

6800b8d

added convinient funcs for working with models

30169d0

fixed AOT & project template to grovety folder

a34ddcb

added conv2d test: using of tensorflow and merging of results

5560645

Added new interface to the MCU firmware (AOT)

3fe5aa8

added mnist model skeleton

ae4d9e3

Fixed opening model for sine_zephyr

5049d86

Addedd mnist model

0e24218

Added full support for mnist model

407b8ff

model conversion hack added

f853977

Arm7m C code optimization integrated. Removed %4 requirement for chan…

ae4c851

…nel.

Merge remote-tracking branch 'mir/PRJ1445-1-conv2d' into grovety

0493721

added support for quantized model

a892bca

Update gemm intrinsic (python\tvm\topi\arm_cpu\cortex_m7\micro_kernel…

d1a33e0

…\gemm.py). Add 16bit variant. Turn off AutoTVM tuning for test (python\tvm\topi\arm_cpu\cortex_m7\conv2d\direct_simd.py).

Fix for qnn.conv2d legalized to int16

ecf7bb2

Use int8 for cortex-m7 until universal gemm implementation for int8/int16

Switch between gemm8 and gemm16.

0ceb612

Merge remote-tracking branch 'mir/Auto_gemm8_gemm16' into PRJ1445-3-a…

5f257ca

…rm-schedules

Format fixes, restores asserts for SMLAD usage

4e8f56e

Added CIFAR10 model

34b1b6f

Great refactoring. All the modeles structured to run in the same way

a81bf16

Try to remove arm_math.h and arm_nnsupportfunctions.h files from C code.

32364a7

Fix: double {{ and }}

820969c

Add max_pool intrinsic (only C code).

e6ac0d7

Merge remote-tracking branch 'mir/Auto_gemm8_gemm16' into grovety

dce7874

Add relu intrinsic (only C code).

4327ce2

Add base of avg_pool intrinsic (fast sum of 16-bit array - only C code).

added { } to follow codestyle

176d2c4

Added intrinsics enabling depending on the target

c19d943

removed useless int8-int16 conversion

1860689

added disable optimization flag

0f01b71

Alex-grovety and others added 2 commits October 18, 2021 11:51

change check for avg_pool

3d577ca

Merge branch 'PR2-preview' into cortex-m7-intrinsic

9c403fe

u99127 suggested changes Oct 22, 2021

View reviewed changes

ilyag-grovety added 2 commits October 27, 2021 14:37

Renaming of "Cortex-M7 SIMD" in commens to DSP

fb17329

* removed odd "micro_dev" * fixes for instrinsics' functions comments

Merge branch 'main' into cortex-m7-intrinsic

4803e19

mehrdadh mentioned this pull request Oct 28, 2021

[CI] Add TVM_INTEGRATION_I386_ONLY for Integration Test on i386 #9388

Merged

Mousius requested changes Oct 29, 2021

View reviewed changes

Sergey Smirnov added 3 commits November 1, 2021 10:05

Merge remote-tracking branch 'origin/main' into cortex-m7-intrinsic

975aff4

Disable corstone tests for i386 run

6cec82c

Merge remote-tracking branch 'origin/main' into cortex-m7-intrinsic

a593538

areusch reviewed Nov 3, 2021

View reviewed changes

methods renamed direct_simd -> dsp

bc2b4ec

files renamed cortex_m7 -> mprofile/dsp

sergio-grovety requested a review from icemelon as a code owner November 9, 2021 09:37

sergio-grovety changed the title ~~Cortex m7 intrinsic~~ [Topi] Cortex-M DSP support Nov 9, 2021

Sergey Smirnov added 2 commits November 9, 2021 12:57

Fixed linter warnings

be7078a

Fixed test name

9d25cf6

Fixed IsaAnalyzer mcpu detection

Merge commit '74accec52e41418d796b6699991c9136993b129e' into cortex-m…

1407614

…7-intrinsic # Conflicts: # tests/python/conftest.py

areusch approved these changes Nov 15, 2021

View reviewed changes

areusch merged commit 76c78a9 into apache:main Nov 15, 2021

mehrdadh mentioned this pull request Nov 15, 2021

[Bug] conv2d_nhwc_direct_simd.arm_cpu schedule has incorrect output with certain workloads #9226

Closed

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

sergio-grovety deleted the cortex-m7-intrinsic branch October 26, 2022 09:20



		def conv1d_nwc_direct_simd(args, *kwargs):
		"""Defines the Cortex-M7 SIMD implementation of conv1d on NWC layout."""



		def conv1d_nwc_direct_simd_schedule(cfg, outs):
		"""Schedule function for Cortex-M7 SIMD implementation of conv1d on NWC layout."""



		@autotvm.register_topi_compute("conv1d_nwc_direct_simd.arm_cpu")
		def conv1d_nwc_direct_simd(cfg, data, kernel, strides, padding, dilation, out_dtype):

-        parser = argparse.ArgumentParser()
+        target = tvm.target.Target(target)
+        march = target.attrs.get("-march", None)
+        self._isa_map = ARM_ISA_MAP[march] if march is not None else []

[Topi] Cortex-M DSP support #9233

[Topi] Cortex-M DSP support #9233

Conversation

sergio-grovety commented Oct 8, 2021 • edited Loading

TVM operations implementation using Cortex-M DSP instructions

nn.conv2d

nn.max_pool2d

nn.avg_pool2d

nn.dense

nn.conv1d

nn.avg_pool1d

nn.max_pool1d

Benchmarking:

Results

areusch commented Oct 18, 2021

u99127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch commented Oct 28, 2021

Mousius left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

areusch commented Nov 9, 2021

areusch left a comment

Choose a reason for hiding this comment

sergio-grovety commented Oct 8, 2021 •

edited

Loading