Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Topi] Cortex-M DSP support #9233

Merged
merged 177 commits into from
Nov 15, 2021
Merged

Conversation

sergio-grovety
Copy link
Contributor

@sergio-grovety sergio-grovety commented Oct 8, 2021

TVM operations implementation using Cortex-M DSP instructions

nn.conv2d

  • We added the implementation of gemm function with 16-bit input
  • shape[-1] multiple of 4 restriction is resolved
  • There is data preparing before 8-bit intrinsic, such preparation will consume too much time in case of small tensor, so we add a check and simple cycle to handle this specific situation
  • In terms of optimization - calculations moved from inside of the intrinsic to outside
  • One of the buffers was radically cut(wasn't in use), lead to reducing memory requirements mostly in a half

nn.max_pool2d

  • Implemented with __SSUB8 and __SEL intrinsics for four 8-bit input values, which is lead to notable acceleration
  • Feature: implementation ready for not 1word-aligned input data
  • Feature: ready for data sizes not a multiple of 4
  • memset is used to initialize the minimum values, to provide max speed

nn.avg_pool2d

  • Due to lack of sum of four 8-bit values intrinsic - implementation could be possible only for 16-bit data
  • __SMLAD intrinsic used to process two 16-bit values
  • Feature: implementation ready for not 1word-aligned input data
  • Feature: ready for data sizes not a multiple of 4

nn.dense

Implemented with same gemm method, described above

nn.conv1d

Specific case of gemm usage - with one of data dimensions equal to 1

nn.avg_pool1d

Implemented for NCW layout with same intrinsic as for 2d version of operation

nn.max_pool1d

Implemented with same intrinsic as for 2d version of operation

Benchmarking:

To enable intrinsic code generation you should specify -mcpu=cortex-m7 flag
HW platform: STM32F746 Nucleo; GCC10, optimization flags: -O3
If you want to enable intrinsic you should specify -march parameter of the target:
target_str = f"c -keys=arm_cpu -mcpu=cortex-m7 -march=armv7e-m -model=stm32f746xx -runtime=c -link-params=1 --executor=aot --unpacked-api=1 --interface-api=c"

Results

ms No Intrinsic Intrinsic enabled
mnist8 8.625 6.574
cifar10 788.36 144.59
  • No intrinsic: march parameter not specified, no code generated for Intrinsic
  • Intrinsic enabled: march=armv7e-m

Sergey Smirnov and others added 30 commits August 23, 2021 10:37
…\gemm.py).

Add 16bit variant.

Turn off AutoTVM tuning for test (python\tvm\topi\arm_cpu\cortex_m7\conv2d\direct_simd.py).
Use int8 for cortex-m7 until universal gemm implementation for int8/int16
Add base of avg_pool intrinsic (fast sum of 16-bit array - only C code).
@areusch
Copy link
Contributor

areusch commented Oct 18, 2021

@u99127 would you like to take a look? also, do you have any suggestions as to how to implement the requires_corstone300 primitive? a simple implementation could be to add a pytest flag --run-corstone300-tests=/path/to/opt/arm/ethosu but am curious if there is a better way. the flag method requires us to make a Jenkinsfile change to differentiate between ci-i386 (with no Corstone 300) and ci-cpu (with Corstone 300).

Copy link
Contributor

@u99127 u99127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly thank you for this and this is pretty interesting as it adds support for the DSP extensions in Cortex-M to improve native TVM schedules. Apologies also for the time it has taken to review this from our side.

I went through this today with @Mousius and while I'm out next week I'm happy for you folks to iterate through this.

I'm pretty cool with the implementation and trust that the tests have taken care of all the operators and that things work and it is suitable. We haven't had the time to review the individual operator implementations yet.

Some top level comments.

  1. Could we fix the commit message summary to refer to adding support for the DSP extensions in Cortex-M, the instructions are implemented in the Cortex-M7 but are also available in other CPUs as well and thus by the use of these we should be able to get them elsewhere.

  2. Are the numbers published here are they from running this on Cortex-M7 silicon or on the FVP ? The usage of the FVP allows functional testing of these schedules and we should do that to make sure that these remain correct. I don't expect the performance numbers that come out of the FVP to be meaningful.

  3. Directory structure in the schedules should reflect that this is about DSP extensions and not anything Cortex-M7 specific. So I would suggest something like arm_cpu/mprofile/dsp as the directory structure.

  4. This is a massive Pull request and mixes many things together and I would certainly recommend that this is broken into more manageable chunks to help review and get into the code base.

I'll put most of the other comments individually below where I'd like some clarifications and some changes in this.


Ramana Radhakrishnan

if (
avg_pool
and layout in ("NCW", "NCHW")
and "SMLAD" in isa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMLAD, SSUB8 and SEL are part of the DSP instructions and the presence of one implies the presence of the other. I also think that in this case since we are adding all of these together globbing them into a single check for the use of the DSP extensions should be sufficient. Any reason why we are testing individual instructions ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with you that we should refactor this. this was left over from the initial implementation which did propose to test for presence of instructions in the ISA; however, you're right that we should just need to determine which architecture is in use. since this PR just adds additional schedules which are purported to be compatible with cortex-m7 devices, perhaps we can address the question of lookup-by-architecture in a follow-on.


ARM_ISA_MAP = {
"armv7e-m": ["SMLAD"],
"armv7e-m": ["SMLAD", "SSUB8", "SEL"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

armv7e-m : DSP ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@u99127 as discussed, let's punt the architecture labelling to the next PR.


ARM_ISA_MAP = {
"armv7e-m": ["SMLAD"],
"armv7e-m": ["SMLAD", "SSUB8", "SEL"],
"armv8-m": ["SMLAD", "SSUB8", "SEL"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you want is armv8-m.main.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here

f : function
Function to mark
"""
_requires_corstone300 = [pytest.mark.corstone300]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a better way of controlling this - possibly something @Mousius could comment on here ?



def conv1d_nwc_direct_simd(*args, **kwargs):
"""Defines the Cortex-M7 SIMD implementation of conv1d on NWC layout."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could well work in general for Armv7em and Armv8m.main and indeed any Cortex-M CPU that implements the DSP instruction set. The biggest win that one would get is in the use of these instructions rather than anything micro-architectural here.

Thus I would suggest trying to model this properly in terms of the ISA .

@Mousius would you have some time to take a look at this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i'm up for moving away from isa_analyzer. that was just an initial stab, but i agree that modeling this at the level of architecture rather than instruction makes more sense. however, i think it would be good to do that in a follow-on PR. this PR could then move forward isolated to what was tested already (on STM32F746 nucleo, I believe), and a follow-on could expand support to the broader architecture. what do you think of this?

# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""Conv1d implementations for cortex-m7."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these folders exist already probably should be renamed as armv7em/dsp.

Maybe that move is a separate pull request rather than being merged in here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with this, perhaps we do this with the follow-on to expand to architecture?



def conv1d_nwc_direct_simd_schedule(cfg, outs):
"""Schedule function for Cortex-M7 SIMD implementation of conv1d on NWC layout."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix comments to reflect that these are using the DSP extensions rather than SIMD.

Perhaps say :

"Schedule function for v7em DSP instructions of conv1d on NWC layout"

Please audit the whole file for such usage and fix it everywhere.

# pylint: disable=invalid-name, unused-variable, no-else-return, unused-argument, import-outside-toplevel
"""Dense schedule for ARM CPU"""

from .cortex_m7.dense import direct_simd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen with this on AArch64 ? Since these schedules are available on both AArch64 and AArch32 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schedule is only for v7e-m strategy (check on isa). On AArch64 "dense.generic" strategy will be chosen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it makes sense then to not import the cortex_m7 direct_simd into this module. can we reorganize as discussed in the earlier thread?

tests/python/conftest.py Outdated Show resolved Hide resolved
use_unpacked_api=True,
target_opts={
"-keys": "arm_cpu",
"-march": "armv7e-m",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused with the use of -march=armv7e-m and not -mcpu here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed--i think -mcpu was used to key the IsaAnalyzer, correct?

@areusch
Copy link
Contributor

areusch commented Oct 28, 2021

discussed a bit with @grant-arm and he and @Mousius @manupa-arm will propose a solution to detecting the Corstone-300 FVP binary and ETHOSU_PATH, which are the two things we need to configure based on corstone300.mk.

Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub has helpfully made it difficult to manage the comments in these threads, so I'll place this reply here. Let me know if I missed a specific thread somewhere.

yeah i'm up for moving away from isa_analyzer. that was just an initial stab, but i agree that modeling this at the level of architecture rather than instruction makes more sense. however, i think it would be good to do that in a follow-on PR. this PR could then move forward isolated to what was tested already (on STM32F746 nucleo, I believe), and a follow-on could expand support to the broader architecture. what do you think of this?

I'm fine with splitting out the isa_analyzer into a follow up, it should be trivial to map it to the extensions. However, I'd still advocate for splitting the remaining work into smaller patches which can be reviewed independently. There's a conflation of testing updates, boilerplate for the operators and the operators themselves which we can reason about per operator perhaps?

Also, please update the commit message on this pull request to reflect the DSP extension 😸



@autotvm.register_topi_compute("conv1d_nwc_direct_simd.arm_cpu")
def conv1d_nwc_direct_simd(cfg, data, kernel, strides, padding, dilation, out_dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggested a file structure such as:

arm_cpu/mprofile/dsp/conv1d.py

This leaves room to add other architecture extensions in future rather than stacking them all in one directory.

@@ -374,6 +374,8 @@ def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
attrs["kernel_layout"],
attrs["groups"],
)

# Use int8 for Cortex-M7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not limited to this CPU?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sergey-grovety can you revert this comment or fix the set of CPUs indicated?

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sergey-grovety! i went through the remaining comments. i think given @mehrdadh merged the i386only PR, we can now proceed to merge this once the remaining comments are addressed.

as @Mousius said, let's punt the IsaAnalyzer changes to another PR. however, let's make the organizational changes now so as to avoid confusion in the placement of schedules in topi.


ARM_ISA_MAP = {
"armv7e-m": ["SMLAD"],
"armv7e-m": ["SMLAD", "SSUB8", "SEL"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@u99127 as discussed, let's punt the architecture labelling to the next PR.


ARM_ISA_MAP = {
"armv7e-m": ["SMLAD"],
"armv7e-m": ["SMLAD", "SSUB8", "SEL"],
"armv8-m": ["SMLAD", "SSUB8", "SEL"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here

# TODO: actually parse -mcpu
arch = "armv7e-m"
self._isa_map = ARM_ISA_MAP[arch]
parser = argparse.ArgumentParser()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should use the built-in Target parsing logic here rather than argparse:

Suggested change
parser = argparse.ArgumentParser()
target = tvm.target.Target(target)
march = target.attrs.get("-march", None)
self._isa_map = ARM_ISA_MAP[march] if march is not None else []

(also need to delete the following lines 33-36--suggestion didn't quite get the diff)

if config.getoption("--enable-corstone300-tests"):
if not "corstone300" in item.keywords:
item.add_marker(
pytest.mark.skip(reason="Test should be marked 'corstone300' to run")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we just need one skip, right? doesn't this skip all other tests aside from corstone300?

if (
avg_pool
and layout in ("NCW", "NCHW")
and "SMLAD" in isa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with you that we should refactor this. this was left over from the initial implementation which did propose to test for presence of instructions in the ISA; however, you're right that we should just need to determine which architecture is in use. since this PR just adds additional schedules which are purported to be compatible with cortex-m7 devices, perhaps we can address the question of lookup-by-architecture in a follow-on.

@@ -374,6 +374,8 @@ def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
attrs["kernel_layout"],
attrs["groups"],
)

# Use int8 for Cortex-M7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sergey-grovety can you revert this comment or fix the set of CPUs indicated?



@autotvm.register_topi_compute("conv1d_nwc_direct_simd.arm_cpu")
def conv1d_nwc_direct_simd(cfg, data, kernel, strides, padding, dilation, out_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mprofile seems good to me.

# pylint: disable=invalid-name, unused-variable, no-else-return, unused-argument, import-outside-toplevel
"""Dense schedule for ARM CPU"""

from .cortex_m7.dense import direct_simd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it makes sense then to not import the cortex_m7 direct_simd into this module. can we reorganize as discussed in the earlier thread?

use_unpacked_api=True,
target_opts={
"-keys": "arm_cpu",
"-march": "armv7e-m",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed--i think -mcpu was used to key the IsaAnalyzer, correct?

files renamed cortex_m7 -> mprofile/dsp
@sergio-grovety sergio-grovety changed the title Cortex m7 intrinsic [Topi] Cortex-M DSP support Nov 9, 2021
Sergey Smirnov added 2 commits November 9, 2021 12:57
Fixed IsaAnalyzer mcpu detection
@areusch
Copy link
Contributor

areusch commented Nov 9, 2021

looks like we are busted at head, #9480 is the fix

…7-intrinsic

# Conflicts:
#	tests/python/conftest.py
Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sergey-grovety !

@areusch areusch merged commit 76c78a9 into apache:main Nov 15, 2021
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Dec 1, 2021
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev>
Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev>
Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev>
Co-authored-by: German Tretiakov <german.tretiakov@mir.dev>
Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev>
Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev>
Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Dec 1, 2021
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev>
Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev>
Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev>
Co-authored-by: German Tretiakov <german.tretiakov@mir.dev>
Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev>
Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev>
Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev>
Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev>
Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev>
Co-authored-by: German Tretiakov <german.tretiakov@mir.dev>
Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev>
Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev>
Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 11, 2022
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev>
Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev>
Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev>
Co-authored-by: German Tretiakov <german.tretiakov@mir.dev>
Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev>
Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev>
Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev>
Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev>
Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev>
Co-authored-by: German Tretiakov <german.tretiakov@mir.dev>
Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev>
Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev>
Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
@sergio-grovety sergio-grovety deleted the cortex-m7-intrinsic branch October 26, 2022 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants