Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few typo fixes in the uTVM design doc. #7291

Merged
merged 1 commit into from
Jan 15, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 18 additions & 17 deletions docs/dev/microtvm_design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ change for a proof-of-concept implementation on such devices, the runtime cannot
projects implement support for these, but they are by no means standard.
* Support for programming languages other than **C**.

Such changes require a different appraoch from the TVM C++ runtime typically used on traditional
Such changes require a different approach from the TVM C++ runtime typically used on traditional
Operating Systems.

Typical Use
Expand Down Expand Up @@ -92,7 +92,7 @@ Modeling Target Platforms
-------------------------

TVM's search-based optimization approach allows it to largely avoid system-level modeling of targets
in favor of experimental results. However, some modelling is necessary in order to ensure TVM is
in favor of experimental results. However, some modeling is necessary in order to ensure TVM is
comparing apples-to-apples search results, and to avoid wasting time during the search by attempting
to compile invalid code for a target.

Expand Down Expand Up @@ -143,10 +143,10 @@ Writing Schedules for microTVM

For operations scheduled on the CPU, microTVM initially plans to make use of specialized
instructions and extern (i.e. hand-optimized) functions to achieve good performance. In TVM, this
appraoch is generally accomplished through tensorization, in which TVM breaks a computation into
approach is generally accomplished through tensorization, in which TVM breaks a computation into
small pieces, and a TIR extern function accelerates each small piece.

TVM currently accomodates both approaches using ``tir.call_extern``. First, a pragma is attached to
TVM currently accommodates both approaches using ``tir.call_extern``. First, a pragma is attached to
the schedule defining the extern function in portable C.

``sched[output].pragma(n, "import_c", "void call_asm(int32_t* a, int32_t* b) { /* ... */ }")``
Expand Down Expand Up @@ -183,10 +183,11 @@ are of course not easy to use from LLVM bitcode.
Executing Models
----------------

The TVM compiler traditionally outputs 3 pieces:
1. Model operator implementations, as discussed above.
2. A model execution graph, encoded as JSON
3. Simplified parameters
The TVM compiler traditionally outputs three pieces:

1. Model operator implementations, as discussed above;
2. A model execution graph, encoded as JSON; and
3. Simplified parameters.

To correctly execute the model, a Graph Runtime needs to reconstruct the graph in memory, load the
parameters, and then invoke the operator implementations in the correct order.
Expand All @@ -206,11 +207,11 @@ Host-Driven Execution

In Host-Driven execution, the firmware binary is the following:

1. Generated operator implementations from TVM
2. The TVM C runtime
1. Generated operator implementations from TVM.
2. The TVM C runtime.
3. SoC-specific initialization.
4. The TVM RPC server.
5. (optional) Simplified Parameters
5. (optional) Simplified Parameters.

This firmware image is flashed onto the device and a GraphRuntime instance is created on the host.
The GraphRuntime drives execution by sending RPC commands over a UART:
Expand Down Expand Up @@ -270,7 +271,7 @@ For Standalone model execution, firmware also needs:
5. The remaining compiler outputs (Simplified Parameters and Graph JSON).

The Automated Build Flow
-------------------------
------------------------

Once code generation is complete, ``tvm.relay.build`` returns a ``tvm.runtime.Module`` and the
user can save the generated C source or binary object to a ``.c`` or ``.o`` file. From this point, TVM
Expand All @@ -287,12 +288,12 @@ However, for AutoTVM, TVM needs some automated flow to handle the following task
At present, TVM expects the user to supply an implementation of the ``tvm.micro.Compiler``,
``tvm.micro.Flasher``, and ``tvm.micro.Transport`` interfaces. TVM then:

1. Builds each piece separately as a library
1. Builds each piece separately as a library.
2. Builds the libraries into a binary firmware image.
3. Programs the firmware image onto an attached device.
4. Opens a serial port to serve as the RPC server transport.

This design was chosen to reduce build times for microTVM (the common libraries need to be build
This design was chosen to reduce build times for microTVM (the common libraries need to be built
only once per candidate operator implemmentation). In practice, these projects are extremely small
and compile relatively quickly. Compared with the added complexity of this tighter build integration
with TVM, the performance gains are likely not worth it. A future design will consolidate the build
Expand All @@ -303,7 +304,7 @@ Measuring operator performance

The TVM C runtime depends on user-supplied functions to measure time on-device. Users should implement
``TVMPlatformTimerStart`` and ``TVMPlatformTimerStop``. These functions should measure wall clock time, so there
are some pitfalls in implementing this function:
are some pitfalls in implementing these functions:

1. If the CPU could halt or sleep during a computation (i.e. if it is being done on an accelerator),
a cycle counter should likely not be used as these tend to stop counting while the CPU is asleep.
Expand All @@ -313,7 +314,7 @@ are some pitfalls in implementing this function:
4. The timer should not interrupt computation unless absolutely necessary. Doing so may affect the
accuracy of the results.
5. Calibrating the output against a wall clock is ideal, but it will likely be too cumbersome. A
future PR could enable some characterization of the platform timer by e.g. measuring the internal
future PR could enable some characterization of the platform timer by, e.g., measuring the internal
oscillator against a reference such as an external crystal.

Future Work
Expand All @@ -339,7 +340,7 @@ peak memory usage.
Heterogeneous Execution
-----------------------

Newer Cortex-M SoC can contain multiple CPUs and onboard ML accelerators.
Newer Cortex-M SoCs can contain multiple CPUs and onboard ML accelerators.


Autotuning Target
Expand Down