-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Topi] Cortex-M DSP support #9233
Conversation
…\gemm.py). Add 16bit variant. Turn off AutoTVM tuning for test (python\tvm\topi\arm_cpu\cortex_m7\conv2d\direct_simd.py).
Use int8 for cortex-m7 until universal gemm implementation for int8/int16
Add base of avg_pool intrinsic (fast sum of 16-bit array - only C code).
@u99127 would you like to take a look? also, do you have any suggestions as to how to implement the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly thank you for this and this is pretty interesting as it adds support for the DSP extensions in Cortex-M to improve native TVM schedules. Apologies also for the time it has taken to review this from our side.
I went through this today with @Mousius and while I'm out next week I'm happy for you folks to iterate through this.
I'm pretty cool with the implementation and trust that the tests have taken care of all the operators and that things work and it is suitable. We haven't had the time to review the individual operator implementations yet.
Some top level comments.
-
Could we fix the commit message summary to refer to adding support for the DSP extensions in Cortex-M, the instructions are implemented in the Cortex-M7 but are also available in other CPUs as well and thus by the use of these we should be able to get them elsewhere.
-
Are the numbers published here are they from running this on Cortex-M7 silicon or on the FVP ? The usage of the FVP allows functional testing of these schedules and we should do that to make sure that these remain correct. I don't expect the performance numbers that come out of the FVP to be meaningful.
-
Directory structure in the schedules should reflect that this is about DSP extensions and not anything Cortex-M7 specific. So I would suggest something like arm_cpu/mprofile/dsp as the directory structure.
-
This is a massive Pull request and mixes many things together and I would certainly recommend that this is broken into more manageable chunks to help review and get into the code base.
I'll put most of the other comments individually below where I'd like some clarifications and some changes in this.
Ramana Radhakrishnan
if ( | ||
avg_pool | ||
and layout in ("NCW", "NCHW") | ||
and "SMLAD" in isa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMLAD, SSUB8 and SEL are part of the DSP instructions and the presence of one implies the presence of the other. I also think that in this case since we are adding all of these together globbing them into a single check for the use of the DSP extensions should be sufficient. Any reason why we are testing individual instructions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with you that we should refactor this. this was left over from the initial implementation which did propose to test for presence of instructions in the ISA; however, you're right that we should just need to determine which architecture is in use. since this PR just adds additional schedules which are purported to be compatible with cortex-m7 devices, perhaps we can address the question of lookup-by-architecture in a follow-on.
python/tvm/target/arm_isa.py
Outdated
|
||
ARM_ISA_MAP = { | ||
"armv7e-m": ["SMLAD"], | ||
"armv7e-m": ["SMLAD", "SSUB8", "SEL"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
armv7e-m : DSP ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@u99127 as discussed, let's punt the architecture labelling to the next PR.
python/tvm/target/arm_isa.py
Outdated
|
||
ARM_ISA_MAP = { | ||
"armv7e-m": ["SMLAD"], | ||
"armv7e-m": ["SMLAD", "SSUB8", "SEL"], | ||
"armv8-m": ["SMLAD", "SSUB8", "SEL"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you want is armv8-m.main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here
f : function | ||
Function to mark | ||
""" | ||
_requires_corstone300 = [pytest.mark.corstone300] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a better way of controlling this - possibly something @Mousius could comment on here ?
|
||
|
||
def conv1d_nwc_direct_simd(*args, **kwargs): | ||
"""Defines the Cortex-M7 SIMD implementation of conv1d on NWC layout.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could well work in general for Armv7em and Armv8m.main and indeed any Cortex-M CPU that implements the DSP instruction set. The biggest win that one would get is in the use of these instructions rather than anything micro-architectural here.
Thus I would suggest trying to model this properly in terms of the ISA .
@Mousius would you have some time to take a look at this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i'm up for moving away from isa_analyzer. that was just an initial stab, but i agree that modeling this at the level of architecture rather than instruction makes more sense. however, i think it would be good to do that in a follow-on PR. this PR could then move forward isolated to what was tested already (on STM32F746 nucleo, I believe), and a follow-on could expand support to the broader architecture. what do you think of this?
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
"""Conv1d implementations for cortex-m7.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these folders exist already probably should be renamed as armv7em/dsp.
Maybe that move is a separate pull request rather than being merged in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with this, perhaps we do this with the follow-on to expand to architecture?
|
||
|
||
def conv1d_nwc_direct_simd_schedule(cfg, outs): | ||
"""Schedule function for Cortex-M7 SIMD implementation of conv1d on NWC layout.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix comments to reflect that these are using the DSP extensions rather than SIMD.
Perhaps say :
"Schedule function for v7em DSP instructions of conv1d on NWC layout"
Please audit the whole file for such usage and fix it everywhere.
python/tvm/topi/arm_cpu/dense.py
Outdated
# pylint: disable=invalid-name, unused-variable, no-else-return, unused-argument, import-outside-toplevel | ||
"""Dense schedule for ARM CPU""" | ||
|
||
from .cortex_m7.dense import direct_simd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would happen with this on AArch64 ? Since these schedules are available on both AArch64 and AArch32 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This schedule is only for v7e-m strategy (check on isa). On AArch64 "dense.generic" strategy will be chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it makes sense then to not import the cortex_m7 direct_simd into this module. can we reorganize as discussed in the earlier thread?
use_unpacked_api=True, | ||
target_opts={ | ||
"-keys": "arm_cpu", | ||
"-march": "armv7e-m", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused with the use of -march=armv7e-m and not -mcpu here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed--i think -mcpu was used to key the IsaAnalyzer, correct?
* removed odd "micro_dev" * fixes for instrinsics' functions comments
discussed a bit with @grant-arm and he and @Mousius @manupa-arm will propose a solution to detecting the Corstone-300 FVP binary and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub has helpfully made it difficult to manage the comments in these threads, so I'll place this reply here. Let me know if I missed a specific thread somewhere.
yeah i'm up for moving away from isa_analyzer. that was just an initial stab, but i agree that modeling this at the level of architecture rather than instruction makes more sense. however, i think it would be good to do that in a follow-on PR. this PR could then move forward isolated to what was tested already (on STM32F746 nucleo, I believe), and a follow-on could expand support to the broader architecture. what do you think of this?
I'm fine with splitting out the isa_analyzer
into a follow up, it should be trivial to map it to the extensions. However, I'd still advocate for splitting the remaining work into smaller patches which can be reviewed independently. There's a conflation of testing updates, boilerplate for the operators and the operators themselves which we can reason about per operator perhaps?
Also, please update the commit message on this pull request to reflect the DSP extension 😸
python/tvm/topi/arm_cpu/conv1d.py
Outdated
|
||
|
||
@autotvm.register_topi_compute("conv1d_nwc_direct_simd.arm_cpu") | ||
def conv1d_nwc_direct_simd(cfg, data, kernel, strides, padding, dilation, out_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggested a file structure such as:
arm_cpu/mprofile/dsp/conv1d.py
This leaves room to add other architecture extensions in future rather than stacking them all in one directory.
@@ -374,6 +374,8 @@ def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types): | |||
attrs["kernel_layout"], | |||
attrs["groups"], | |||
) | |||
|
|||
# Use int8 for Cortex-M7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not limited to this CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sergey-grovety can you revert this comment or fix the set of CPUs indicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sergey-grovety! i went through the remaining comments. i think given @mehrdadh merged the i386only PR, we can now proceed to merge this once the remaining comments are addressed.
as @Mousius said, let's punt the IsaAnalyzer changes to another PR. however, let's make the organizational changes now so as to avoid confusion in the placement of schedules in topi.
python/tvm/target/arm_isa.py
Outdated
|
||
ARM_ISA_MAP = { | ||
"armv7e-m": ["SMLAD"], | ||
"armv7e-m": ["SMLAD", "SSUB8", "SEL"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@u99127 as discussed, let's punt the architecture labelling to the next PR.
python/tvm/target/arm_isa.py
Outdated
|
||
ARM_ISA_MAP = { | ||
"armv7e-m": ["SMLAD"], | ||
"armv7e-m": ["SMLAD", "SSUB8", "SEL"], | ||
"armv8-m": ["SMLAD", "SSUB8", "SEL"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here
python/tvm/target/arm_isa.py
Outdated
# TODO: actually parse -mcpu | ||
arch = "armv7e-m" | ||
self._isa_map = ARM_ISA_MAP[arch] | ||
parser = argparse.ArgumentParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should use the built-in Target parsing logic here rather than argparse:
parser = argparse.ArgumentParser() | |
target = tvm.target.Target(target) | |
march = target.attrs.get("-march", None) | |
self._isa_map = ARM_ISA_MAP[march] if march is not None else [] |
(also need to delete the following lines 33-36--suggestion didn't quite get the diff)
tests/python/conftest.py
Outdated
if config.getoption("--enable-corstone300-tests"): | ||
if not "corstone300" in item.keywords: | ||
item.add_marker( | ||
pytest.mark.skip(reason="Test should be marked 'corstone300' to run") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we just need one skip, right? doesn't this skip all other tests aside from corstone300?
if ( | ||
avg_pool | ||
and layout in ("NCW", "NCHW") | ||
and "SMLAD" in isa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with you that we should refactor this. this was left over from the initial implementation which did propose to test for presence of instructions in the ISA; however, you're right that we should just need to determine which architecture is in use. since this PR just adds additional schedules which are purported to be compatible with cortex-m7 devices, perhaps we can address the question of lookup-by-architecture in a follow-on.
@@ -374,6 +374,8 @@ def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types): | |||
attrs["kernel_layout"], | |||
attrs["groups"], | |||
) | |||
|
|||
# Use int8 for Cortex-M7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sergey-grovety can you revert this comment or fix the set of CPUs indicated?
python/tvm/topi/arm_cpu/conv1d.py
Outdated
|
||
|
||
@autotvm.register_topi_compute("conv1d_nwc_direct_simd.arm_cpu") | ||
def conv1d_nwc_direct_simd(cfg, data, kernel, strides, padding, dilation, out_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mprofile seems good to me.
python/tvm/topi/arm_cpu/dense.py
Outdated
# pylint: disable=invalid-name, unused-variable, no-else-return, unused-argument, import-outside-toplevel | ||
"""Dense schedule for ARM CPU""" | ||
|
||
from .cortex_m7.dense import direct_simd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it makes sense then to not import the cortex_m7 direct_simd into this module. can we reorganize as discussed in the earlier thread?
use_unpacked_api=True, | ||
target_opts={ | ||
"-keys": "arm_cpu", | ||
"-march": "armv7e-m", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed--i think -mcpu was used to key the IsaAnalyzer, correct?
files renamed cortex_m7 -> mprofile/dsp
Fixed IsaAnalyzer mcpu detection
looks like we are busted at head, #9480 is the fix |
…7-intrinsic # Conflicts: # tests/python/conftest.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sergey-grovety !
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev> Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev> Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev> Co-authored-by: German Tretiakov <german.tretiakov@mir.dev> Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev> Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev> Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev> Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev> Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev> Co-authored-by: German Tretiakov <german.tretiakov@mir.dev> Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev> Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev> Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev> Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev> Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev> Co-authored-by: German Tretiakov <german.tretiakov@mir.dev> Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev> Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev> Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev> Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev> Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev> Co-authored-by: German Tretiakov <german.tretiakov@mir.dev> Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev> Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev> Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
Co-authored-by: Sergey Smirnov <Sergey.Smirnov@mir.dev> Co-authored-by: Ekaterina Bern <Ekaterina.Bern@mir.dev> Co-authored-by: Mikhail Trubnikov <Mikhail.Trubnikov@mir.dev> Co-authored-by: German Tretiakov <german.tretiakov@mir.dev> Co-authored-by: Ilya Gozman <Ilya.Gozman@mir.dev> Co-authored-by: Alexey.Yazev <Alexey.Yazev@mir.dev> Co-authored-by: Ilya Gozman <92577591+ilyag-grovety@users.noreply.github.com>
TVM operations implementation using Cortex-M DSP instructions
nn.conv2d
nn.max_pool2d
nn.avg_pool2d
nn.dense
Implemented with same gemm method, described above
nn.conv1d
Specific case of gemm usage - with one of data dimensions equal to 1
nn.avg_pool1d
Implemented for NCW layout with same intrinsic as for 2d version of operation
nn.max_pool1d
Implemented with same intrinsic as for 2d version of operation
Benchmarking:
To enable intrinsic code generation you should specify -mcpu=cortex-m7 flag
HW platform: STM32F746 Nucleo; GCC10, optimization flags: -O3
If you want to enable intrinsic you should specify -march parameter of the target:
target_str = f"c -keys=arm_cpu -mcpu=cortex-m7 -march=armv7e-m -model=stm32f746xx -runtime=c -link-params=1 --executor=aot --unpacked-api=1 --interface-api=c"
Results