Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Arm Ethos-U Integration #11

Merged
merged 8 commits into from
Sep 28, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 33 additions & 34 deletions rfcs/0011_Arm_Ethos-U_Integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,34 @@

# Motivation

Arm® Ethos™-U is a series of NPUs that will enable low-cost and highly efficient AI solutions for a wide range of embedded devices. This RFC introduces the port of Ethos-U into the uTVM compilation flow. The process of compilation relies on the multiple levels of abstraction in TVM and a variety of analysis and optimisation passes to produce c output. In the process of compilation, we rely on the many levels of TVM's IR (and the passes) to perform optimizations to create c-sources that can work with current microTVM deployments.
Arm® Ethos™-U is a series of NPUs that will enable low-cost and highly efficient AI solutions for a wide range of embedded devices. This RFC introduces the port of the NPU into the uTVM compilation flow. The process of compilation relies on the multiple levels of abstraction in TVM and a variety of analysis and optimisation passes to produce c output. In the process of compilation, we rely on the many levels of TVM's IR (and the passes) to perform optimizations to create c-sources that can work with current microTVM deployments.
manupak marked this conversation as resolved.
Show resolved Hide resolved

## Scope:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain the intent behind this RFC, so readers know whether it's comprehensive or whether to expect follow-on RFCs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its explained : "The scope for this RFC is to add support for offloading to the Arm® Ethos™-U55 NPU. The initial machine learning framework that we use for testing this is TensorFlow Lite. Future RFCs and pull requests will address additional NPUs such as the Ethos™-U65, optimization to the compilation pipeline and other frameworks as the port evolves."


### Ethos™-U55

![](./assets/0011/ethosu_hw.png)

Ethos™-U55 is a NPU that is designed to uplift ML performance by working as an offload target for micro-controllers. It can accelerate quantized ML operators such as Convolution2D, Depthwise Convolution, Pooling and Elementwise Operators. For convolution-type operators, Ethos-U55 supports hardware enabled loseless de-compression of weights to increase inference performance and reduce power.
Ethos™-U55 is a NPU that is designed to uplift ML performance by working as an offload target for micro-controllers. It can accelerate quantized ML operators such as Convolution2D, Depthwise Convolution, Pooling and Elementwise Operators. For convolution-type operators, NPU supports hardware enabled loseless de-compression of weights to increase inference performance and reduce power.

The scope for this RFC is to add support for offloading to the Arm Ethos-U55 NPU. The initial machine learning framework that we use for testing this is TensorFlow Lite. Future RFCs and pull requests will address additional NPUs, such as the Ethos-U65, and other frameworks as the port evolves.
The scope for this RFC is to add support for offloading to the Arm® Ethos-U55 NPU. The initial machine learning framework that we use for testing this is TensorFlow Lite. Future RFCs and pull requests will address additional NPUs such as the Ethos-U65, optimization to the compilation pipeline and other frameworks as the port evolves.

Please refer to Technical Reference Manual (TRM) for more details – https://developer.arm.com/documentation/102420/0200.
Please refer to [Technical Reference Manual (TRM)](https://developer.arm.com/documentation/102420/0200) for more details.
* Reference : https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u55

# Guide-level explanation

## TVMC User Interface
```
tvmc compile my_model.tflite
manupak marked this conversation as resolved.
Show resolved Hide resolved
--executor=aot
--output-format=mlf
--executor=aot \
--output-format=mlf \
--target="ethos-u --accelerator-config=ethos-u55-xxx",c" ---> Model Library Format
manupak marked this conversation as resolved.
Show resolved Hide resolved

# where xxx could be out of possible configuration of the accelerator that can take values : [32, 64, 128, 256]
# where xxx indicate the possible variant of the accelerator that can take values : [32, 64, 128, 256]
```

The users should be able to use the above command to compile to ethos-u55 that would generate Model Library Format(MLF) output.
Please take a look at our provided example in the last PR (once its published).
The users should be able to use the above command to compile to NPU that would generate [Model Library Format(MLF)](https://github.com/apache/tvm/blob/main/docs/dev/model_library_format.rst) output.

## Design Architecture Overview

Expand All @@ -53,8 +52,8 @@ Please refer to this discuss post for more information : https://discuss.tvm.apa

#### Unified static memory planning :

Ethos™-U is a NPU that is aimed at running with microTVM. Therefore, as with typical usecases of microTVM, Ethos™-U NPU will require aggressive memory optimizations by sharing buffers with intermediaries used by the CPU.
We envision a flow to expose the TIR generated by Ethos™-U codegen to future unified static memory planner to be optimized.
The NPU is aimed at running with microTVM. Therefore, as with typical usecases of microTVM, NPU will require aggressive memory optimizations by sharing buffers with intermediaries used by the CPU.
We envision a flow to expose the TIR generated by codegen to future unified static memory planner to be optimized.

For more information about the proposed unified static memory planner, please refer to this discuss post : https://discuss.tvm.apache.org/t/rfc-unified-static-memory-planning/10099.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps makes more sense to link to the RFC?


Expand All @@ -64,7 +63,7 @@ For more information about the proposed unified static memory planner, please re

### C1. TVM Frontend and Partitioning

The Relay graph as lowered from the TVM's frontend will be partitioned into Ethos-U subgraphs via running AnnotateTarget, MergeCompilerRegions and PartitionGraph Relay passes. Therefore, this procedure will result in the creation of "external" Relay functions that are re-directed to Ethos-U Relay and TIR pass pipeline for the creation of c-source as stated above.
The Relay graph as lowered from the TVM's frontend will be partitioned into subgraphs via running AnnotateTarget, MergeCompilerRegions and PartitionGraph Relay passes. Therefore, this procedure will result in the creation of "external" Relay functions that are re-directed to NPU Relay and TIR pass pipeline for the creation of c-source as stated above.

```
# A Partitioned example for Conv2D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to note explicitly that this is the IRModule output by this stage

Expand All @@ -86,11 +85,14 @@ def @ethosu_0(%ethosu_0_i0: Tensor[(1, 300, 300, 3), int8], Compiler="ethosu", .
```


### C2. Relay Legalization to Ethos™-U HW Primitive operations.
### C2. Relay Legalization to Ethos™-U NPU HW Primitive operations.

In the design, we have decided to introduce TEs that closely describes the compute of each primitive operation that the hardware can natively execute – that we define as Ethos™-U HW primitive operations in their own Relay operators. Moreover, there are many Relay operators that could be lowered to the Ethos™-U HW primitives (e.g., dense could be legalized to a conv2d operator). This component will legalize the external Relay function to Ethos™-U HW primitive operations.
In the design, we have decided to introduce TEs that closely describes the compute of each primitive operation that the hardware can natively execute – that we define as Ethos™-U NPU HW primitive operations in their own Relay operators. The rationale behind adding new relay operations are that they represent a pattern of conventional relay operations that is executed atomically in the hardware.

Moreover, there are many Relay operators that could be lowered to the HW primitives (e.g., dense could be legalized to a conv2d operator). This component will legalize the external Relay function to HW primitive operations.

The NPU supports per-channel quantization through via encoding a scale with each bias value. Thus, the weight scales are converted to that format and packed with the biases. Thereafter, the packed bias and scales are made to a constant input to the Relay operator. The weights are not compressed at this stage, they are compressed later in the subsequent TIR lowering phase.

Ethos™-U hardware supports per-channel quantization through via encoding a scale with each bias value. Thus, the weight scales are converted to that format and packed with the biases. Thereafter, the packed bias and scales are made to a constant input to the Relay operator.
For more details, please refer to : https://developer.arm.com/documentation/102420/0200

```
Expand All @@ -102,9 +104,9 @@ fn (%ethosu_0_i0: Tensor[(1, 300, 300, 3), int8], ..., global_symbol="ethosu_0",
```


### C3. Ethos™-U TE/TIR Compiler Passes
### C3. NPU TE/TIR Compiler Passes

At this stage, we should have a TE representation of all HW primitive operations that belong to the offloaded function. We will be scheduling the TE representation to TIR Primfunc that describes the intermediary storage and hardware operations that needed to be executed. In future, we are intending to add more TE/TIR passes make the Ethos™-U TE/TIR compiler perform memory and performance optimizations (See https://discuss.tvm.apache.org/t/rfc-cascade-scheduling/8119) . Therefore, its vital to have all the operations represented in TE/TIR. Its important to note that Ethos™-U hardware requires weights to be 'encoded' in a certain way to be readable by the hardware. Therefore, the weight encoding is performed here and represented in the TIR primfunc with post-encoding sizes as buffers.
At this stage, we should have a TE representation of all HW primitive operations that belong to the offloaded function. We will be scheduling the TE representation to TIR Primfunc that describes the intermediary storage and hardware operations that needed to be executed. In future, we are intending to add more TE/TIR passes make the NPU TE/TIR compiler perform memory and performance optimizations (See https://discuss.tvm.apache.org/t/rfc-cascade-scheduling/8119) . Therefore, its vital to have all the operations represented in TE/TIR. Its important to note that the hardware requires weights to be 'encoded' in a certain way to be readable by the hardware. Therefore, the weight encoding is performed here and represented in the TIR primfunc with post-encoding sizes as buffers.

```
primfn(placeholder_1: handle, placeholder_2: handle, placeholder_3: handle, ethosu_write_1: handle) -> ()
Expand All @@ -125,12 +127,9 @@ primfn(placeholder_1: handle, placeholder_2: handle, placeholder_3: handle, etho
}
```

### C4. Translating TIR Primfuncs to C-sources that call to the driver APIs to perform the execution.

Given, that the complexity of this component, we'll be putting up a seperate RFC to describe the functionality of Ethos™-U TE/TIR Compiler in detail.

### C4. Translating Ethos™-U TIR Primfuncs to C-sources that call to the Ethos™-U driver APIs to perform the execution.

Ethos™-U hardware is used from the host CPU via invoking a driver API call with a command stream (a Ethos™-U specific binary artefact) that describes the hardware operators that need to execute. This component will use the TIR Primfunc to extract the hardware operators and buffer information. Thereafter, we'll be using Arm® Vela (https://pypi.org/project/ethos-u-vela/) compiler's backend python APIs to convert the TIR Primfunc to a command stream. Finally, the generated command stream will be wrapped in a c-source that invokes it using the Ethos™-U driver APIs.
The hardware is used from the host CPU via invoking a driver API call with a command stream (a hardware specific binary artefact) that describes the hardware operators that need to execute. This component will use the TIR Primfunc to extract the hardware operators and buffer information. Thereafter, we'll be using Arm® Vela (https://pypi.org/project/ethos-u-vela/) compiler's backend python APIs to convert the TIR Primfunc to a command stream. Finally, the generated command stream will be wrapped in a c-source that invokes it using the driver APIs.
manupak marked this conversation as resolved.
Show resolved Hide resolved

```
#include <stdio.h>
Expand All @@ -140,7 +139,7 @@ Ethos™-U hardware is used from the host CPU via invoking a driver API call wit

static const size_t weights_size = 1632;
static const size_t scratch_size = 1632;
// Update linker script to place weights_sec and cms_data_sec in memory that can be accseed by Ethos-U
// Update linker script to place weights_sec and cms_data_sec in memory that can be accseed by the hardware
__attribute__((section("weights_sec"), aligned(16))) static int8_t weights[1632] = "\xc1\x1a...";
__attribute__((section("cms_data_sec"), aligned(16))) static int8_t cms_data_data[396] = "\x43\x4f...";
static const size_t cms_data_size = sizeof(cms_data_data);
Expand Down Expand Up @@ -192,42 +191,42 @@ TVM_DLL int32_t ethosu_0(TVMValue* args, int* type_code, int num_args, TVMValue*

## Build system

The only dependency of TVM compilation for Ethos™-U is using Arm® Vela compiler (https://pypi.org/project/ethos-u-vela/).
However, to run inferences with the sources generated by TVM, we would need to use the Ethos-U core driver (https://git.mlplatform.org/ml/ethos-u/ethos-u-core-driver.git/about/). The user is expected to include the necessary sources if they were looking to use this bare-metal.
The only dependency of TVM compilation for Ethos™-U NPU is using [Arm® Vela compiler](https://pypi.org/project/ethos-u-vela/).
However, to run inferences with the sources generated by TVM, we would need to use the [Ethos-U NPU core driver](https://git.mlplatform.org/ml/ethos-u/ethos-u-core-driver.git/about/). The user is expected to include the necessary sources if they were looking to use this bare-metal.

## Testing

Firstly, we will be providing unit tests for the components described above.

Secondly, we are planning to use Arm® Corestone™-300 Fixed Virtual Platform (FVP – https://developer.arm.com/ip-products/subsystem/corstone/corstone-300) in the CI to be able to simulate the codegen'd artifacts of TVM on a SoC that has Arm® Cortex™-M55 and Ethos™-U55.
Secondly, we are planning to use [Arm® Corstone™-300 reference system](https://developer.arm.com/ip-products/subsystem/corstone/corstone-300) in the CI to be able to simulate the codegen'd artifacts of TVM on a SoC that has Arm® Cortex™-M55 and Ethos™-U55.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there ever be plans to provide third party CI for testing on a hardware cloud?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc : @u99127, However I feel this question is out of scope for the RFC.

Copy link

@u99127 u99127 Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @manupa-arm

@hogepodge

We currently do not have any plans for adding hardware into a cloud. Ethos-U is licensed as a part of Arm’s IP portfolio. I expect SoCs and boards containing Ethos-U from the broader Arm ecosystem in due course but I’m not at liberty to speculate on time lines.

As part of the Ethos-U upstreaming work and the first PR that has been merged for docker images, Arm has contributed a reference system, what we call Fixed Virtual Platforms for the Corstone-300 subsystem in the TVM CI. We believe this is sufficient for continually testing the correctness of the Ethos-U port and indeed for testing correctness on Cortex-M.

The value of hardware based correctness testing is debatable beyond testing that comes from the simulator and is beyond the scope of this RFC or Pull Requests.


We will be providing end-to-end tests in two categories :

* Operator tests : single operator tests will be executed.
* Network tests : a few interested networks will be executed.

We are introducing a test_runner application that uses the AoT executor (See https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206). Moreover, the included test_runner application would be a harness that could serve as a sample application for inferences on the Ethos™-U55.
We are introducing a test_runner application that uses the [AoT executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206). Moreover, the included test_runner application would be a harness that could serve as a sample application for inferences on the hardware.

## Code location

python/tvm/relay/backend/contrib/ethosu/ – Main directory that holds the relay legalization passes, TIR to Command Stream translation and the integration of the codegen.
python/tvm/relay/backend/contrib/ethosu/op – The definition of Ethos-U relay operators
python/tvm/relay/backend/contrib/ethosu/te – The TE compute definitions of Ethos-U relay operators
python/tvm/relay/backend/contrib/ethosu/tir – The TIR compiler for performance and memory optimization of Ethos-U Relay operators
python/tvm/relay/backend/contrib/ethosu/op – The definition of NPU relay operators
python/tvm/relay/backend/contrib/ethosu/te – The TE compute definitions of NPU relay operators
python/tvm/relay/backend/contrib/ethosu/tir – The TIR compiler for performance and memory optimization of NPU Relay operators

src/relay/backend/contrib/ethosu/ – C++ sources for implementation of passes (where compile-time performance is critical) and the generation of C-source module.
tests/python/contrib/test_ethosu/ – The test directory

# Upstreaming Plan

The scope for the initial upstreaming is adding support for Conv2D offloading to Ethos-U.
The scope for the initial upstreaming is adding support for Conv2D offloading to the NPU.

* [P1] The ci_cpu Dockerfile changes and install scripts – Arm® Corestone™-300 FVP and Ethos™-U core driver
* [P1] The ci_cpu Dockerfile changes and install scripts – Arm® Corstone™-300 reference system and Ethos™-U NPU core driver
* [P2] The Relay passes with unit tests for Conv2D (Partitioning, Preprocessing and Legalization)
* [P3] The Ethos™-U Relay operators, TE compute definitions and TIR Passes for Conv2D and tests (unit / partial integration tests)
* [P3] The NPU Relay operators, TE compute definitions and TIR Passes for Conv2D and tests (unit / partial integration tests)
* [P4] TIR to CS translator for Conv2D with unit tests
* [P5] The C source generator
* [P6] The overall codegen integration and tests
* [P7] TVMC changes and tutorial to how to run a network on the Arm® Corestone™-300 FVP
* [P7] TVMC changes and tutorial to how to run a network on the Arm® Corstone™-300 reference system.

Once the initial PRs are landed – we are planning to improve operator coverage.