Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIR][Analysis] Implement IdentifyMemCpy analysis function #13947

Merged
merged 9 commits into from
Mar 4, 2023

Conversation

Lunderberg
Copy link
Contributor

This commit adds a utility function tir::IdentifyMemCpy, which analyzes a tir::For loop and determines whether it is equivalent to a memcpy. If it is equivalent, it will additionally return the source and destination regions of the memcpy.

This utility is initially intended for use in the LowerAsyncDMA pass, but may be useful in other areas in the
future. (e.g. Identifying an entire TIR function as equivalent to memcpy would allow calls to that TIR function to be removed from end-to-end models.)

This commit adds a utility function `tir::IdentifyMemCpy`, which
analyzes a `tir::For` loop and determines whether it is equivalent to
a memcpy.  If it is equivalent, it will additionally return the source
and destination regions of the memcpy.

This utility is initially intended for use in the `LowerAsyncDMA`
pass, but may be useful in other areas in the
future.  (e.g. Identifying an entire TIR function as equivalent to
memcpy would allow calls to that TIR function to be removed from
end-to-end models.)
@tvm-bot
Copy link
Collaborator

tvm-bot commented Feb 10, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

Lunderberg and others added 6 commits February 10, 2023 13:37
Looks like there's some confusion over whether the LWG1203 proposals
should be retroactively applied to C++11 onward.  GCC 11 implements it
for C++17, but GCC 9 does not, so this failed to compile in CI.

https://stackoverflow.com/q/69325059/2689797
namespace tvm {
namespace tir {

std::variant<MemCpyDetails, std::string> IdentifyMemCpyImpl(const For& loop,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tricky to return the result and error message at the same time. An alternate way is to raise a specific exception and catch from outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, though I think the std::variant avoids most of the usual issues. Rather than returning both the result and the error message at the same time, it returns either the result or the error message, but never both. I didn't want to throw an exception, as I expect the "error" case to be much more common, which is why the public-facing API exposes a std::optional<MemCpyDetails> instead of the internal variant.

Copy link
Member

@Hzfengsy Hzfengsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise looks good to me. Please update and fix the CI and we can get this in

src/tir/analysis/identify_memcpy.cc Outdated Show resolved Hide resolved
tests/python/unittest/test_tir_analysis_identify_memcpy.py Outdated Show resolved Hide resolved
Copy link
Contributor

@adstraw adstraw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Hzfengsy Hzfengsy merged commit c0f148a into apache:main Mar 4, 2023
@Lunderberg Lunderberg deleted the identify_memcpy branch March 6, 2023 15:22
vinx13 added a commit to vinx13/tvm that referenced this pull request Mar 27, 2023
* [microTVM] Fix tvmc tutorial (#14076)

This PR applies appropriate changes to make sure the CI fails if micro_tvmc.sh tutorial fails. This issue was captured in #14074.
This PR also makes changes to avoid this breakage in bash script tutorials in future. In addition, this PR fixes the bug in running TVMC tutorial which happened due to renaming zephyr_board to board.

* [MetaSchedule] Introduce Async Pipeline in MultiLevelTiling (#14009)

This PR introduces async pipeline in the current TVM's MultiLevelTiling Rules. This PR is based on apache/tvm#13966, which is already merged. This is because some conv2d workload will use `tir.if_then_else` to pad the input to the correct size, and this PR uses async copy in such copy statement.

1. Add a subrule in `src/meta_schedule/schedule_rule/multi_level_tiling.h/.cc` that annotate async copy for mlt in supported arch (>= sm80).

In CUDA Core, this PR has a perf boost of around 1T GFLOP/s in most Conv2d test cases and 1T ~ 2T in most GEMM test cases.

All generated codes, scripts, and traces are available at https://github.com/Rainy-Memory/tvm-async-rule-benchmark.

Currently tested on commit `afbfb7aa7e43732cb716f8e443df696110be6afc` in conv2d NHWC workload, with a RTX 3080 GPU.

**Notice: given the stochastic nature of evolutionary search, perfromance might become worse if enable this PR.**

Workload: Conv2d NHWC

|Shape|Mainline TVM|Mainline TVM with Async|Performance Boost|
|-|-|-|-|
|N=1_H=224_W=224_C=3_K=64_R=7_S=7_STR=2_PAD=3_DIL=1|13838.05219|14687.89452|6.141343581679319%|
|N=1_H=56_W=56_C=64_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|5398.305085|5613.892553|3.9936140067192905%|
|N=1_H=56_W=56_C=64_K=64_R=3_S=3_STR=1_PAD=1_DIL=1|11652.96825|13157.88249|12.91442839038028%|
|N=1_H=56_W=56_C=64_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|10638.8309|11674.68499|9.736540600527816%|
|N=1_H=56_W=56_C=256_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|8692.32829|9469.264089|8.938178277203573%|
|N=1_H=56_W=56_C=256_K=128_R=1_S=1_STR=2_PAD=0_DIL=1|4685.767442|5698.19634|21.606469175684712%|
|N=1_H=28_W=28_C=128_K=128_R=3_S=3_STR=1_PAD=1_DIL=1|9872.787087|10404.60405|5.38669535070061%|
|N=1_H=28_W=28_C=128_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|9974.281496|10073.31657|0.9929043414276753%|
|N=1_H=28_W=28_C=512_K=128_R=1_S=1_STR=1_PAD=0_DIL=1|7075.866932|8564.572712|21.039199780135142%|
|N=1_H=28_W=28_C=512_K=256_R=1_S=1_STR=2_PAD=0_DIL=1|3648.330914|4021.923142|10.240086132713124%|
|N=1_H=14_W=14_C=256_K=256_R=3_S=3_STR=1_PAD=1_DIL=1|8192.954618|9160.182054|11.805599824451525%|
|N=1_H=14_W=14_C=256_K=1024_R=1_S=1_STR=1_PAD=0_DIL=1|8008.870153|9362.825279|16.90569456283206%|
|N=1_H=14_W=14_C=1024_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|5210.062241|6051.208379|16.144646629759908%|
|N=1_H=14_W=14_C=1024_K=512_R=1_S=1_STR=2_PAD=0_DIL=1|2550.787202|3587.902938|40.65865373586739%|
|N=1_H=7_W=7_C=512_K=512_R=3_S=3_STR=1_PAD=1_DIL=1|4350.626084|5432.788068|24.873706981617943%|
|N=1_H=7_W=7_C=512_K=2048_R=1_S=1_STR=1_PAD=0_DIL=1|6672.068026|7663.725217|14.862815953549454%|
|N=1_H=7_W=7_C=2048_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|3142.564263|4297.988014|36.766909259541826%|

Workload: GEMM NN

|Shape|Mainline TVM|Mainline TVM with Async|Performance Boost|
|-|-|-|-|
|M=512_N=256_K=640|8678.46|10607.37|22.226408832903555%|
|M=512_N=384_K=256|8109.13|10290.72|26.902886006267003%|
|M=512_N=512_K=512|11419.83|14000.86|22.601299669084398%|
|M=512_N=3072_K=768|19709.39|18351.61|-6.8890006235606425%|
|M=512_N=768_K=3072|12844.59|13730.88|6.90010346768561%|
|M=896_N=896_K=896|16149.91|16131.39|-0.11467556165947945%|
|M=1024_N=1024_K=1024|18842.11|19662.8|4.355616223448428%|
|M=1152_N=1152_K=1152|15386.79|16736.1|8.769275462913303%|
|M=1536_N=1536_K=1536|18522.67|18872.06|1.88628313304725%|
|M=2048_N=2048_K=2048|19515.42|18874.85|-3.282378754851291%|
|M=3072_N=3072_K=3072|19233.9|19291.42|0.2990553137948975%|
|M=4096_N=4096_K=4096|17122.17|19259.01|12.479960191961652%|

* [TVMScript] Use op attribute to control whether to print dtype in TVMScript (#14111)

This PR adds an op attribute `TScriptDtypePrintLocation`, and modifies the dtype printing logic of the builtin op to check this attribute. So that user defined operators can use it to specify how there dtype argument are printed by appending attributes instead of appending members to `dtype_first_arg`/`dtype_last_arg`.

* [Fix][TVMScript] Fix index of metadata in printed script (#14130)

Currently, if the same metadata object (e.g. a multi-line `tir.StringImm`) is referenced for more than one times in an IRModule, each reference will have different indices of the metadata array. For example, this code

```
str_imm = T.StringImm("aaa\nbbb\n")
@I.ir_module
class Module:
    @T.prim_func
    def foo() -> None:
        A = str_imm
        B = str_imm

    @T.prim_func
    def foo1() -> None:
        A = str_imm
Module.show()
```

where `str_imm` is referenced three times, will generate such output:

```
@I.ir_module
class Module:
    @T.prim_func
    def foo():
        A: T.handle = metadata["tir.StringImm"][0]
        B: T.handle = metadata["tir.StringImm"][1]
        T.evaluate(0)

    @T.prim_func
    def foo1():
        A: T.handle = metadata["tir.StringImm"][2]
        T.evaluate(0)
```

Each time has a different metadata index. 

This PR fixes this problem by detecting duplicate item in `IRDocsifierNode::AddMetadata`.

* [Pytorch] frontend full_impl fix (#14122)

Minor fix in pytorch frontend to compile gpt2 model, reproduce script.
torch_version = 1.13.1
transformers_version = 4.26.1

```
from transformers import GPT2LMHeadModel
import torch
import tvm
from tvm import relay

inp = torch.ones((1, 128)).to(torch.int64)
input_shapes = [("input_ids", ((1, 128), "int64"))]

model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=False)
trace_model = torch.jit.trace(model, inp, strict=False)
outputs = trace_model(inp)

mod, params = relay.frontend.from_pytorch(trace_model, input_shapes)
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target='llvm', params=params)

runtime = tvm.contrib.graph_executor.GraphModule(lib["default"](tvm.device('cpu', 0)))
runtime.set_input("input_ids", inp.numpy())
runtime.run()
out = runtime.get_output(0).numpy()
print(out)
print('Done...')
```

Before the fix, the error message
```
Traceback (most recent call last):
  File "gpt2_compile.py", line 13, in <module>
    mod, params = relay.frontend.from_pytorch(trace_model, input_shapes)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4791, in from_pytorch
    outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4164, in convert_operators
    relay_out = relay_op(
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 841, in full
    return self.full_impl(data, fill_value, dtype)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 743, in full_impl
    fill_value = _expr.const(fill_value, dtype=dtype)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/expr.py", line 707, in const
    raise ValueError("value has to be scalar or NDArray")
ValueError: value has to be scalar or NDArray
```

because `fill_value` is
```
%0 = cast(64, dtype="float32");
power(%0, 0.5f)
```

* [DOCKER] Configurable NDK version support (#14000)

Let the Android NDK version configurable as a command line argument

* [Fix][TIR] SampleCategorical apply-to-schedule (#14133)

This PR is another way to fix the issue described in #14118.

Since we do not have a standard for json file on the format of float
numbers (for example, we cannot require a json file producer to print
the "integer" float numbers with at least one decimal), and the json
parser is not responsible for determining if an integer in a json file
should be parsed to a float or an int, the most convenient way of fixing
the SampleCategorical issue will be allowing both FloatImms and IntImms
as input, and converting all IntImms to FloatImms accordingly.

This PR fixes the issue in this way.

* [Arith] ConstIntBound was incorrectly assuming bounds were over int64… (#13918)

[Arith] ConstIntBound was incorrectly assuming bounds were over int64_t range

This commit improved the floormod and floordiv conversion check to be
simpler for the negative range and adds a  test to cover all integer data types.

* [CMSIS-NN] Reduction in code size of AOT test runner binary (#13815)

* [CMSIS-NN] Reduction in code size of AOT test runner binary


Co-authored-by: Ashutosh Parkhi <ashutosh.parkhi@arm.com>

* [CMSIS-NN] Add a runtime error message (#13643)

[CMSIS-NN] Add a runtime error message

APIs TVMAPISetLastError and TVMGetLastError are used to propagate CMSIS-NN
errors caught in the backend. AOT test runner was improved to observe the contents
of this global variable. A test was added to check for the last set error as part of this
commit.

* [CRT]Cleanup unused macros in crt_config.h.template (#14125)

This PR removes old macros in crt_config.h.template.

* [Fix][Relay] Fix axis transformation in squeeze shape function (#14135)

* fix squeeze shape function issue and add testcase.

* fix lint

* [Unittest] merge test_cp_async_in_if_then_else into test_tir_transform_inject_ptx_async_copy (#14138)

This PR merge two related unittests into one.

* [Frontend][TFLite] Fix conv2d import bug (#14124)

* Fix TFLite frontend bug and add test

* lint

* [ONNX][TORCH] Replace scatter op by scatter_elements (#14019)

* remove scatter attr class

* update pytorch: scatter was replaced by scatter_elements

* remove scatter compute and strategy registration

* remove scatter attrs registration

* update onnx front-end: replace _op.scatter by _op.scatter_elements, add checks

* update oneflow front-end

* update paddlepaddle front-end

* update pytorch utils

* remove front-end scatter definition

* fix scatter strategy for rocm

* small update

* remove scatter definition in back-end

* remove scatter strategy for cuda, gpu. transfer special case to scatter_elements

* fix test

* small fix

* upstream scatter with torch description

* last upstream of scatter in pytorch front-end

* fix reduction attribute in cuda strategy

* set scalar to test instead of tensor. update check for dynamic dim

* skip scalar source check in tests for scatter due to issue on torch side

* remove scatter op implementation from topi/cuda

* remove scatter op implementation from topi. small clean code

---------

Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>

* [TVMScript][Printer] Remove relax prefix for now (#14140)

Remove relax prefix for now

This PR cleans up relax prefix in printer for now.
While these setups are useful and do not cause any
technical debts in the codebase. We remove it given requests.
They can be added back to unity branch and later as part
of upstream

* [microNPU] Sum legalization support (#13997)

Supports legalizing a relay sum operation to an equivalent series of NPU operations. It supports case with int8 output type and channel axis.

* [Fix][MetaSchedule] Fix redundant stages in async pipeline for mlt (#14143)

This PR fixes redundant stages if visiting `InitializeWithTuneContext`
multiple times.

* [COMMUNITY] Cheng Wen -> Reviewer (#14153)

Please join me @chengven027-intellif as a new Reviewer in TVM.
Cheng has contributed to ONNX/PyTorch frontend and Relay passes, making TVM support more input models.

- [Commits History](https://github.com/apache/tvm/pulls?q=author%3Achengven027-intellif+)
- [Code Review](https://github.com/apache/tvm/pulls?q=reviewed-by%3Achengven027-intellif+)

* [Runtime] Fix high RAM usage when saving / loading paramters of big models   (#14147)

* add load_params_from_file

* add save_params_to_file

* avoid making another copy in save_params

* black

* add test

* update doc

* [Relay][Frontend] Span Filling PyTorch (#14050)

* [Relay][Frontend] Span Filling PyTorch

- Construct debug name of C graph instruction as the source name of span for pytorch model.
- To get the reference of renamed nodes. Add a function to export the converted C graph after conversion.
- Add structural_equal comparisons with and without set_span to the existing test cases.
- Add span test cases for frequent conversions.
- Add span test case for exporting model parameter.

* [SpanFillingPyTorch]

- Return TupleGetItem expr from TupleWrapper with the span of its Tuple.
- Add None type symbol in set sapn for certain conversion.
- Add current_op member varible to PyTorchOpConverter to track which op
  is converting for pytorch frontend.

* [SpanFillingPyTorch]

- Fix the error caused by the quantized params not found after renaming
  the debug name of C graph.

---------

Co-authored-by: Joey Tsai <chunit@qti.qualcomm.com>

* [TRT][BYOC] allow strided_slice ops on selected dimensions (#14142) (#14144)

* [ONNX][TOPI] Add `DFT` operator (#13999)

* init convertor for DFT

* init test for DFT

* init DFT operator in Relay

* update topi implementation for DFT

* clean up

* update ONNX frontend

* support  attribute

* fix error: Expected Array[Tensor], but got Array[index 0: Array]

* support inverse, onsided, dft_lenght

* update tests for DFT

* update TOPI test for DFT

* add documentation

* fix pylint

* fix cpplint

* fix cpplint

* fix threshold for FP16 (ARM)

* add CUDA compute

* fix pylint

* fix doc string

* code review fixes for ONNX front-end

* code review fixes for TOPI

* rename: stft.py -> signal.py

* pass input_shape and output_shape to verify_dft

* [CRT][microTVM] Enable USMP by default for AoTExecutor + CRT runtime (#14107)

This PR enables USMP by default when AoTExecutor and CRT runtime are selected. Check forum discussion about this change: https://discuss.tvm.apache.org/t/enable-usmp-by-default-in-aot-executor-with-runtime-crt/14406

As a result, the workspace memory in mlperftiny project type is removed since memory allocation is not required. If we keep this workspace, the model doesn't fit since some of the memory is allocated twice.

* [Android] Fix using system libraries in Android apps (#14145)

- Starting from API 31, using `uses-native-library` is required if we
  want to open system library:
  https://developer.android.com/about/versions/12/reference/compat-framework-changes#enforce_native_shared_library_dependencies

  We should specify OpenCL library in `user-native-library` in all
  applications where OpenCL backend might be used.

- Updated README files and describe how to fix synchronization issues
  in Android Studio.

* [microTVM]Enable TVMC micro with AoT Executor  (#14077)

This PR enables AoT Executor for tvmc micro compilation.

* [bugfix] Fix the write buffer scope of `mma_store_impl` (#14174)

fix

* [Relay] Enhance EliminateCommonSubexpr to support Tuple argument (#14169)

If an argument of a call is a Tuple, we should check its fields.
Different tuples with the same fields should be treated as same inputs

* [TIR] Fix typo in doc (#14178)

* [microTVM] Use QNN schedules to give SOTA performance (#13752)

In #13242, I rewrote microTVM's convolution schedules to give a major improvement in performance. While I demonstrated in tests that my changes worked, they could not be used with relay.build.

This pull request expands the functionality of #13242 and adds new legalize and alter_op passes to take advantage of the quantized schedules. This dramatically improves performance on some models, dramatically cuts RAM usage, and removes the need for autotuning on microTVM. More specifically, for the vww model from MLPerf Tiny running on the nucleo_l4r5zi, this pull request:

- Improves untuned performance from 1741 ms to 137 ms - a 6.8x improvement!
- Improves tuned performance from 337 ms to 137 ms.
- Sets a new state-of-the-art for MLPerf Tiny, beating Plumerai's previous 208 ms record
- Reduces RAM consumption by 73 KB (a large amount on microcontrollers!) by eliminating intermediate buffers.
- Reduces flash consumption for model weights by 5x
- Slightly improves accuracy

@mehrdadh has kindly tested these changes himself, and has confirmed my 137 ms figure.

To enable the schedules that grant these performance improvements, this pull request:

1. Adds out_layout support to the regular and depthwise conv2d schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242.
2. Generalizes the schedules from [microTVM] Modernize Arm Cortex-M convolution schedules #13242 to be more widely applicable.
3. Adds a layout alternation pass to ensure regular and depthwise conv2d schedules always get their desired input formats.
4. Adds a conv2d -> depthwise conv2d -> unpadded conv2d rewrite step to remove empty channels from conv2d operators.
5. Adds a conv2d -> average pool -> dense rewrite step to remove empty channels from conv2d operators.
6. Adds an alter_op pass to fold padding into a separate Relay operator.

* Add v0.11.0 docs link to site (#14181)

Update the version menu in TVM documentation to add a specific v0.11.0 release docs link.

* [TIR] Allow TransformLayout with non-inversible index map (#14095)

* [TIR] Allow TransformLayout with non-inversible index map

TransformLayout requires the index map to have inverse map that can be
calculated by the analyzer in order to check whether padding is added.
However, such check doesn't always work for all cases because of
limitation of the affine analysis that can only handle a set of
supported patterns. In some cases, even if the index map doesn't
introduce padding, the schedule primitive throws `TransformationIntroducesPaddingError` because it
fails to calculate the inverse index map.

It is safe to allow buffer being padded without providing pad_value
because the original loop extent is not changed and the padded region is not accessed.
This PR changes the behavior of `TransformLayout` to allow
non-inversible index map.

Previous discussion:
https://discuss.tvm.apache.org/t/conflict-free-shared-memory-permutation-in-tensorir/13959/9

* add assume_injective_transform option

* Apply suggestions from code review

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

---------

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [TIR][Analysis] Implement IdentifyMemCpy analysis function (#13947)

* [HotFix][MetaSchedule] Turn off database shash check (#14188)

At this moment, the structural hash values of IR in TVM is platform
dependent (e.g., the hash values of a String may differ on different
platforms). In our recent practice, we found this an obstacle for us
to apply one existing database on different platforms (say we tune
an IRModule with MetaSchedule on Metal, and then apply the database
on CUDA, etc.)

To clear this obstacle, we decide to remove the shash value check. The
purpose of that check is mainly to ensure safety, and thus turning it
off will make no difference in terms of using MetaSchedule in most of
the cases that we can imagine.

Meanwhile, it is equally important that we need to make our structural
hash platform independent. There are plans ongoing for this target.

* [TOPI] Batch Norm Training Mode (#14190)

Prior to this PR, TOPI batch_norm only supports inference.

This PR adds training: bool flag and momentum: float argument to support training mode (update moving_mean / var and return), which aligns with torch.nn.functional.batch_norm.

* [TOPI] Group normalization (#14193)

As more and more ML models nowadays contain the group normalization
computation, we find it beneficial to introduce this op to TOPI level.
It will enable us to optimize the group normalization operation as a
whole in a more convenient way.

This PR introduces the group normalization op to TOPI. The group norm
operation was introduced in https://arxiv.org/abs/1803.08494. The
implementation uses tuple reduction, same as the implementation of layer
norm. Implemented with tuple reduction, the corresponding generated TIR
function can be optimized by cross-thread reduction or rfactor through
MetaSchedule.


Co-authored-by: Bohan Hou <spectrometerh@gmail.com>

* [Fix][TIR] LowerCrossThreadReduction with write-back predicate (#14199)

Prior to this PR, the cross-thread reduction lowering pass does not
add a store predicate to the write-back block. This is in consideration
that for a certain write-back buffer position, all values being stored
(by all the threads) in the write-back block are the same. Since all
threads are writing the same value, we were assuming that not having a
write-back block predicate is fine, because the result will not be wrong
in any way.

However, recently we noticed that some GPU backend compiler will capture
this behavior (multiple threads writing a same position) as a race
condition and thus throw compilation error. The compiler does not take
the fact that all values being stored are the same, and insist on
complaining.

This means that we will still need the write-back block predicate to
make things work. And this PR does this change. I have done integration
tests locally to make sure that the generated kernels is right and
produces the right results numerically.

* [Unity] Relax VM (#13878)

This PR implements a flexible register-based VM to execute relax programs with dynamic shape and control flow. Design: https://github.com/tlc-pack/relax/wiki/Relax-VM-Design.

Co-Authored-by: Ziheng Jiang <ziheng@apache.org>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Steven S. Lyubomirsky <slyubomirsky@octoml.ai>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>

* [Unity] Relax expressions and types (#13901)

* [Unity][IR] First-class StructInfo (#13907)

* [Unity][IR] First-class StructInfo

Relax tracks structural information (such as tensor shape) via `StructInfo` about the values in Relax.

* Fix rust build

---------

Co-authored-by: Junru Shao <junrushao1994@gmail.com>

* [Unity][CI] Unity specific jenkins setup (do not upstream to main) (#13910)

This PR setup a unity specific jenkins with minimum jenkinsfile
without sharding and disables most of the tests to reduce overall
cost. We can add tests of unty branch by configuring the specific
groovy file.

* [Unity] Basic StructInfo Analysis and Expr construction (#13916)

[Unity] Basic StructInfo Analysis and Expr construction.

This PR adds struct info analysis and expr support.
These are logics to construct the IR node and perform
struct info related analysis.

Testcases are added to cover the IR node construction
and related struct info analysis checks.

Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Altan Haan <altanh@cs.washington.edu>
Co-authored-by: Andrew Liu <andrewlliu@gmail.com>
Co-authored-by: Hongyi Jin <3231950289@qq.com>
Co-authored-by: Jiawei Liu <jaway.liu@gmail.com>
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
Co-authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-authored-by: masahi <masahi129@gmail.com>
Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-authored-by: Ziheng Jiang <ziheng@apache.org>

* [Unity] Relax BlockBuilder and ExprMutator (#13926)

This PR adds BlockBuilder: the core data structure to construct Relax AST, and ExprMutator: performs AST mutation for implementing transformation passes.

Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Altan Haan <altanh@cs.washington.edu>
Co-Authored-by: Andrew Liu <andrewlliu@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
Co-Authored-by: Jiawei Liu <jaway.liu@gmail.com>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-Authored-by: masahi <masahi129@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Yixin Dong <ubospica@gmail.com>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Ziheng Jiang <ziheng@apache.org>

* [Unity] Relax TVMScript Parser. (#13932)

This PR adds the TVMScript parser/ir_builder support based on the blockbuilder.

Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Junru Shao <junrushao1994@gmail.com>
Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>

* [Unity] Relax TVMScript Printer (#13944)

This PR introduces Relax as a dialect supported by the TVMScript
Printer. Some caveats:
- Needs to rebase to mainline before merging.
- Some tests are skiped because some operators are not upstreamed to
  the unity branch yet.

Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-authored-by: Hongyi Jin <3231950289@qq.com>
Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity] Relax VM codegen (#13954)

* [Unity] Relax VM shape lowering pass (#13956)

This PR introduces Relax `FunctionPass` and `DataflowBlockPass` API, and the `VMShapeLower` pass to lower the shape expression in Relax to TIR functions and VM shape heap builtin functions.

Co-Authored-by: Ziheng Jiang <ziheng@apache.org>
Co-Authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-Authored-by: Altan Haan <altanh@cs.washington.edu>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Yong Wu <yongcale@gmail.com>

* [Unity] e2e Relax minimum build flow (#13961)

This PR introduces the e2e Relax lowering flow (`relax.vm.build`). Tests for each pass in the flow are added.

Co-Authored-by: Altan Haan <altanh@cs.washington.edu>
Co-Authored-by: Andrew Liu <andrewlliu@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
Co-Authored-by: Jiawei Liu <jaway.liu@gmail.com>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Steven S. <Lyubomirsky slyubomirsky@octoml.ai>
Co-Authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-Authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Ziheng Jiang <ziheng@apache.org>

* [Unity][TVMScript] Use explicit `R.shape` in TVMScript (#13979)

As we've introduced `arg_sinfo` in CallNode, implicit shape constructor
is not widely used in TVMScript. This PR removes the implicit shape since
it may cause confusion between shape and tuple.

* [Unity] Relax op: index (#13987)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the tensor indexing operators.

* [Unity] Relax op: datatype (#13986)

* [Unity] Relax op: set (#13990)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the set operators.

Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>

* [Unity] Relax op: image (#13994)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the image operators.

* [Unity] Relax op: arithmetic, comparison (#13983)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the unary, binary and ternary arithmetic and
comparison operators.

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Chaofan Lin <1713833595@qq.com>

* [Unity] Relax op: statistical (#13991)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the statistical operators.

* [Unity] Relax op: neural networks (#13993)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the neural network operators.

* [Unity] Relax op: creation (#13984)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the tensor creation operators.

* [Unity] Relax op: linear algebra (#13988)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the linear algebra operators.

Co-authored-by: Siyuan Fneg <Hzfengsy@sjtu.edu.cn>

* [Unity] Relax op: search (#13992)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the search operators.

* [Unity] Relax op: manipulation (#13989)

This PR is about the high-level tensor computation operators in Relax.

This PR includes the tensor manipulation operators.

Co-authored-by: Prakalp Srivastava <prakalp@octoml.ai>

* [Unity] NestedMsg Support utility (#13995)

This PR introduce NestedMsg to robustly handle nested-tuple analysis.

Relax support nested tuple structures in the IR.
Nested tuple structure is important to support advanced groupings in
cases such as gradient calculation and other scenarios.

The possible presence of nested tuple does mean that we need to to
robustly handle analysis that contains nested tuple structures in a dataflow graph.

This PR introduces a NestedMsg<T> class that corresponds to a possibly
nested message tuple for a given leaf message class T.
We also introduces various helper functions to compose and decompose messages.

Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>

* [Unity][Pass] Operator Fusion Passes (#14001)

[Unity][Pass] Operator fusion passes

This PR introduces three passes for operator fusion:
1. AnnotateTIROpPattern: analysis the operator kind from PrimFunc.
2. FuseOps: fuse operators for Relax functions, which adds a new fused
relax primitive function.
3. FuseTIR: fuse corresponding TIR PrimFuncs for the fused relax.

* [Unity][Pass] LambdaLift pass (#14012)

* [Unity][VM] Supporting "compiled" exec mode. (#14015)

[VM] Supporting "compiled" exec mode.

This PR adds support of "compiled" mode to the VM. The compiled mode translate
the relax function into TIR function and drive it through the TIR function.

It is different from the micro AOT codegen, which generate TIR code that targets
the micro C runtime environment and useful for resource limited settings with
smaller set of features. Both leverages the low-level TIR build that is also shared with TensorIR.

The current implementation targets full TVM (VM) runtime, that comes with PackedFunc,
object, tuple, closure and all kinds of rich structure support. This also mean that
we can leverage the full runtime support to handle things like allocation, dynamic shape,
easy plugins and python interaction, which are not available in more limited runtime.

The user directly use the same API to load the generated code regardless of
compiled mode or bytecode. And just need to change one line

```python
ex = relax.vm.build(mod, target, exec_mode="compiled")
```

The simplicity is thanks to the TVM runtime archiecture that allows us to compose things together in objects.
The only difference is how the PackedFunc of high-level driving is being provided. In the case of bytecode
it is normal interpretation and in the case of compiled mode it is TIR.

It is a complete implementation Unit-testcases are added. All codegen build tests are updated to include two
exec_modes and have passed locally.

Co-authored-by: Junru Shao <junrushao1994@gmail.com>

* [Unity][Pass] BindParams pass, FoldConstant pass (#14016)

This PR introduces FoldConstant/BindParam passes.

Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Hongyi Jin <3231950289@qq.com>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][Pass][TuningAPI] Introduce TuningAPI and MetaSchedule pass (#14014)

Add TuningAPI and MetaSchedule tuning pass

* [Unity] Relay -> Relax translator  (#14026)

This PR implements a Relay to Relax translator, which allows us to import Relay workloads to Relax for benchmarking and development purposes (tests and examples are added).

* [Unity][Pass] Normalize Pass (#14031)

This PR implements relax `Normalize` Pass, which allows users to transform Relax IR to normal form, i.e., the expressions are normalized (no nesting and hence the AST is in ANF), and all `checked_type_` and `shape_` of expressions are available. (tests are added).

Co-Authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>

* [Unity][BlockBuilder] CallTE convert PrimValue args  (#14028)

Prior to this PR, the `call_te` of BlockBuilder is not capable of converting PrimValue arguments and directly rejects PrimValues instead. This PR fixes this behavior with PrimValue conversion support and one regression test.

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][Pass] Wellformed Analysis (#14032)

This PR implements relax wellformed analysis, which checks if the IRModule is well-formed. (tests and examples are added).

Co-Authored-by: Ruihang Lai <ruihangl@cs.cmu.edu>
Co-Authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-Authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>
Co-authored-by: Yong Wu <yongcale@gmail.com>
Co-Authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-Authored-by: Yixin Dong <ubospica@gmail.com>
Co-Authored-by: Chaofan Lin <siriusneo@sjtu.edu.cn>
Co-Authored-by: Prakalp Srivastava <prakalp@octoml.ai>
Co-Authored-by: Junru Shao <junrushao1994@gmail.com>

* [Unity][TVMScript] Move tir/relax import in script out of __init__.py (#14033)

Prior to this PR, `python/tvm/script/__init__.py` imports both tir and relax
submodules. This leads to the phenomenum that when people does
```python
from tvm.script import tir as T
```
, the relax submodule will be implicitly visited by `__init__.py` as well.

Since TIR does not rely on Relax, it is good not to import both of them
at the same time. (This can prevent cyclic imports sometimes.)

This PR does this decoupling by introducing two files

* `python/tvm/script/relax.py`
* `python/tvm/script/tir.py`

and removing the imports from `python/tvm/script/__init__.py` and
`python/tvm/script/parser/__init__.py`. With this change, we force people to
manually do `from tvm.script import tir` and `from tvm.script import relax`
to use TVMScript parser, which is right our conventional way.

* [Unity][Pass] Operator legalization (#14029)

This PR is the operator legalization pass, which transforms high-level
operator calls to `call_tir`s of corresponding low-level TIR PrimFuncs.

- The legalization pass provides customizability, which enables people
to pass in a customized legalization map to override the default
legalization method.

- The legalization supports symbolic shape. (At this moment only pooling
does not support symbolic shape, as TOPI pooling does not support. This
needs to be fixed in followup PRs.)


Co-authored-by: Chaofan Lin <siriusneo@sjtu.edu.cn>
Co-authored-by: Yixin Dong <ubospica@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][Op] Add ShapeExpr Tests for Reshape Op (#14035)

This PR specially checks the relax.reshape operator when the input is a ShapeExpr.

* [Unity] Initial PyTorch Frontend (#14037)

[Unity] Initial PyTorch Frontend

This PR introduces initial pytorch frontend components of Relax, including
- a FX translator that translates a Torch FX graph module to an TVM IRModule,
- a Relax-backend of Torch Dynamo, which brings the mechanism to build PyTorch model using Relax compilation pipeline,
- a pipeline prototype that contains the collection of pre-defined pipelines that optimizes and lower IRModule before passing to minimum build.

Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
Co-authored-by: Tianqi Chen <tianqi.tchen@gmail.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][Pass] Block-level static memory planning (#14038)

This PR introduces the static memory planning pass on binding block level, as well as an analysis function that estimate the memory usage after the memory planning pass. It supports the following features: nested-tuples,  reuse memory of the input of reshape ops, an estimator that returns total memory size needed to be allocated before and after memory planning, as well as the number of tensors / memory blocks to be allocated before and after memory planning.

The estimation is static -- it does not consider control flows (such as “if” and cross-function calls). It simply accumulates the size of every alloc_tensor and alloc_storage.

We will produce “`relax.memory.alloc_tensor/storage`” as the results produced by memory planning.

* [Unity] Disallow inline prim_func in relax IR (#14040)

Disallow inline prim_func in relax IR

* [Unity] Update tests to adapt to latest TVMScript syntax (#14039)

Given that some latest changes of TVMScript syntax have been merged,
some test files are now containing deprecated uses of TVMScript syntax.
This PR updates the test files with latest TVMScript syntax so that
running the tests will not trigger deprecation warnings.

Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>

* [Unity] Relax dataflow pattern language (matching) (#14041)

The dataflow pattern language for Relax (originally from https://github.com/tlc-pack/relax/pull/163).

The implementation splits patterns into two parts:

- Match an Expression: match an expression syntactically (MatchExprPattern, i.e., DFPatternMatcher);
- Match a Graph: match a graph (cross multiple VarBinding) topologically (MatchGraphPattern);

* [Unity] Statement rewriter for DataflowBlock (#14043)

This PR implements a few APIs to quickly perform statement-level mutation:
`add`/`remove_unused`/`remove_all_unused`/`replace_all_uses`.
It also implements `remove_all_unused` to remove dead statements inside `DataflowBlock`.

* [Unity][Pass] FuseOps FuseTIR fixes (#14044)

This PR fixes two bugs of FuseOps and FuseTIR:

It fixes FuseOps who only rewrites the "main" function of the
IRModule. After the fix, FuseOps now goes through each non-primitive
Relax function. Test cases for both FuseOps and FuseTIR sides are added
so ensure that both of the two passes work for cases of multiple Relax
functions.

It also fixes FuseOps and FuseTIR who did not take "call_dps_packed" style
"call_tir" into account. The previous behavior will directly downcast
the first argument of "call_tir" to GlobalVar, which is not right when
the "call_tir" is in "call_dps_packed" stype and the first argument is
a PackedFunc. With this fix, FuseOps and FuseTIR will skip such
"call_tir"s. Tests for both CallTIR and CallOps are added accordingly.

* [Unity][TVMScript] Overload `__neg__` for relax expr (#14045)

This PR overloads `__neg__` given that `relax.negative` is now supported. Besides, it adds `test_op_misc.py` and brings tests for calling overloaded operators.

* [Unity][VM] Add per-op profiling support  (#14053)

Adds per-op profiling support to Relax VM, in a way similar to how Relay VM is instrumented via the common profiling infra in the runtime. Profiling over RPC is supported.

Example output:
```
Name                          Duration (us)  Percent    Device  Count                                 Argument Shapes  
conv2d1                          705,779.00    51.22  hexagon0      1  float32[1, 64, 56, 56], float32[1, 64, 54, 54]  
conv2d                           669,589.00    48.60  hexagon0      1  float32[1, 64, 56, 56], float32[1, 64, 56, 56]  
relu                                 683.00     0.05  hexagon0      1  float32[1, 64, 56, 56], float32[1, 64, 56, 56]  
relu1                                679.00     0.05  hexagon0      1  float32[1, 64, 54, 54], float32[1, 64, 54, 54]  
vm.builtin.check_tensor_info          28.00     0.00  hexagon0      1                          float32[1, 64, 56, 56]  
vm.builtin.match_shape                25.00     0.00  hexagon0      1                          float32[1, 64, 56, 56]  
----------                                                                                                             
Sum                            1,376,783.00    99.93                6                                                  
Total                                  0.00               cpu0      1                                                  
Total                          1,377,809.00           hexagon0      1                                                  

Configuration
-------------
Number of threads: 4
Executor: VM
```


The original PR: https://github.com/tlc-pack/relax/pull/422

* [Unity][BYOC] Add pattern-based partitioning pass (#14054)

This adds a new pass, FuseOpsByPattern, which applies pattern matching to each function in the given module, and groups matched expressions into a new function. The end result is similar to FuseOps, but fusion is driven completely by
the provided patterns. The implementation also reuses OperatorFusor used by FuseOps to create grouped functions from partitioned groups, further illustrating the similarity between the two passes.

The new pass will serve the same role the MergeComposite pass plays in Relay BYOC - grouped functions are annotated with the "composite" attribute to denote what operations a given function consists of, and offloaded to external backends. But it can be also useful in non-BYOC settings, for example to support advanced fusion that the op-kind based one doesn't handle (fused MHA, conv2d / gemm + reduction fusion, etc).

The original PR: https://github.com/tlc-pack/relax/pull/366

* [Unity] Relax op: collapse sum (#14059)

This PR brings high-level operators `relax.collapse_sum_like` and `relax.collapse_sum_to` which is useful when doing AD in Relax. To achieve this, it exposes the interface of `topi.collapse_sum`. Moreover, this PR also implements the legalization of these op and adds corresponding tests.

* [Unity][Fix][Pass] Fix FuseOps for lack graph edges (#14058)

This PR fixes a mistake of #14044. In #14044, in VisitLeaf of graph
construction of FuseOps, we first check if the input node is Leaf and
then check if it is Tuple. This is not right: as Tuple is not
categorized as one leaf node, when the input node is a Tuple, the
function will return since the input is not a LeafNode. And the check
for Tuple will thereby never holds.

It is quite interesting that our existing unit tests fail to filter this
mistake out. I add a regression test for this case, which can ensure
that the tuple is always visited.

* [Unity][Pass] Remove Unused Function (#14061)

This PR implements a pass to clean up unused functions. 

Co-authored-by: masahi <masahi129@gmail.com>

* [Unity][BYOC] Add pass to merge composite functions to offload large subgraphs (#14062)

This PR adds a pass that merges neighboring calls to composite functions offloaded to the same external backend into one function. This is important for backends that want to receive as large subgraph as possible, for example TensorRT. It plays the same role as the MergeCompilerRegion pass in Relay BYOC does, and the algorithm follows the same idea described in https://discuss.tvm.apache.org/t/relay-improved-graph-partitioning-algorithm/5830. 

Original PR
https://github.com/tlc-pack/relax/pull/372

Substantial improvement by @yelite 
https://github.com/tlc-pack/relax/pull/411

Related fix PR by @yelite 
https://github.com/tlc-pack/relax/pull/406

Co-authored-by: Lite Ye  <yelite958@gmail.com>

* [Unity][Frontend] Annotate number of non-static input of FX function (#14067)

* [Unity][Transform] Add LiftTransformParams pass (#14069)

This PR added a pass `LiftTransformParams`. It allows to compile the
end-to-end model without weights provided. The idea is annotate the
input parameters that are weights, and identify and lift the
transformations to weights, and compile it to a separate function
`transform_params` that can be executed in runtime. Users can run
`transform_params` with weights to get the weights for the optimized
model as a prep step before the deployment. In this way, we perform the
same optimizations and defer the weight transformations to the user
side, while the overhead of the deferred weight transformation can be
ignored as it only need to be run once.

This pass is integrated with the default `vm.build`. It is optional and
only necessary when the parameters are kept as inputs when importing the
model from the frontend.

* [Unity][BYOC][Pass] RunCodegen and TensorRT  (#14078)

This PR introduces the fundamental workflow for BYOC and integrate TensorRT as a demonstration.

* [Unity][Pass] Canonicalize Bindings (#14079)

It may be useful for some passes to collapse chains of definitions, particularly after other compiler transformations that may reduce or simplify some expressions.

This pass will take chains of definitions and replace references to later definitions to the original one. It works by checking `LookupBinding` for each var use-site and replacing the var with its definition if the definition was another var. Additionally, `MatchCast` bindings where the LHS and the RHS are guaranteed to match at compile time are canonicalized into ordinary `VarBinding`s.

Example:
```python
y = x
z = y
w = z
o = w
p = o
```
Will be replaced with
```python
y = x
z = x
w = x
o = x
p = x
```

Original PR: https://github.com/tlc-pack/relax/pull/233

Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>

* [Unity] Add testcases for `expr_args_converter` (#14080)

This is a missing test file when we added the `expr_args_converter`. This
PR adds it back.

* [Unity][BYOC] Add CUTLASS backend (#14081)



Co-authored-by: Lite Ye  <yelite958@gmail.com>

* [Unity][BYOC] Add DNNL backend (#14082)

This PR adds dnnl backend to the unity flow.

* [Unity][Op] `log_softmax` and `cross_entropy_with_logits` (#14083)

This PR introduces two high-level operators log_softmax and cross_entropy_with_logits, which are important when we are calculating CrossEntropyLoss (in torch).

Co-authored-by: Yixin Dong <ubospica@gmail.com>

* [Unity][Analysis] TIR pattern kind analysis for multi-buffer write block (#14075)

This PR supports TIR pattern kind analysis for TIR blocks which write
to multiple buffers, which is helpful for normalization operators like
layernorm, groupnorm, etc.

Prior to this PR, the analyzer does not support a blocks which write to
multiple buffers. On seeing such a block, the analyzer simply sets the
analysis result to "opaque". With this PR, on seeing a block which
writes multiple buffers, the analyzer will check if all the BufferStores
have the same indices. And it will only set the result to "opaque" when
the BufferStores have different indices.

By doing this, the analysis works for common cases where a block may
write to multiple buffers, like layernorm or groupnorm.

Besides the unit test for the analysis itself, this PR also adds a unit
test for FuseOps pass, make sure that a "layernorm + relu" pattern can
be fused together.

* [Unity][Fix][Pass] FoldConstant with DCE in dataflow block (#14087)

The current FoldConstant pass does not support removing unused bindings
in the post-folding function. Therefore, for large real-world models,
the built executable will be overlarge because of the redundant unused
constants.

This PR removes the redundant unused constant bindings in FoldConstant
by using the analysis function "RemoveAllUnused".

Note that "RemoveAllUnused" only works at dataflow block level.
Therefore FoldConstant will not remove unused bindings outside of
dataflow block as well.

* [Unity] Refactor Relax Build JIT UX (#14088)

This PR refactors relax build so it get exposed at the opt-level.
We also introduces an explicit jit functionality to handle
live loading of compiled artifacts from cutlass.

We also move relax vm to runtime so it can be clearly isolated
from the rest of the compiler stack.

* [Unity][Relax] Set Shape Function to Be Host Function (#14090)

Set shape function to be host func.

* [Unity] Fix typo in the comment (#14096)

* [Unity] Lower `shape_of` to a builtin (#14093)

This PR lowers shape_of op to a Relax VM builtin, and changes a utility function to take StructInfo as input.

Co-authored-by: Steven S. Lyubomirsky <slyubomirsky@gmail.com>

* [Unity] Relax Recursive function (#14092)

This PR adds TVMScript local recursive function support. It also update lambda lifting pass. Removed CalledGlobalVars, it was not used anymore. It also updates well-form pass to allow un-defined vars for recursive call

* [Unity][Layout] Add layout transformation analysis for PrimFunc (#14066)

* [Layout] Add layout transformation analysis for PrimFunc.

This change adds a PrimFunc level analysis to suggest layout transformations to block and buffers in the PrimFunc based on the layout transformations to PrimFunc outputs.

* Add support for multiple blocks such as split op.

* Add negative tests and increase coverage.

* fix warning message

* fix lint

* remove unused header

* Address comments.
Moved some utility functions to support/array.h
improve doc

* fix deprecation warn T.var("int64") to T.int64()

* address comments

* [Unity] Remove attributes of relax.print, assert and unique (#14101)

Remove the attributes of operators assert, print and unique.
Use PrimValue as substitute.

Co-authored-by: Steven S. Lyubomirsky [slyubomirsky@gmail.com](mailto:slyubomirsky@gmail.com)
Co-authored-by: Prakalp Srivastava [prakalp@octoml.ai](mailto:prakalp@octoml.ai)

* [Unity][BYOC]Add relax backend pattern registry (#14106)

* Add relax backend pattern registry

* Add doc

* [Unity] Update tests again to adapt to latest TVMScript syntax (#14115)

* finished

* fix

* rollback merge_composite_functions

* [Unity][Fix] Fix bug in MergeCompositeFunctions (#14117)

Currently `MergeCompositeFunctions` will modify the map while iterating over it, and that makes tests/python/relax/test_transform_merge_composite_functions.py does not pass. This PR fixes this bug.

* [Unity][BlockBuilder] Add `name_hint` argument for `emit` and `emit_output` (#14126)

This PR adds `name_hint` argument for `emit` and `emit_output` API of Relax blockbuilder. The argument exists in the C++ side but not exposed to Python side (So user who use the Python bb.emit will let `name_hint` be `""` by default).

Co-authored-by: Yixin Dong <ubospica@gmail.com>

* [Unity][WEB] Relax vm on web runtime (#14131)

This PR brings initial relax vm support on web runtime

* [Unity] Add Global info (#14132)

* [Unity][BYOC] Add transposed matmul support to Relax CUTLASS BYOC (#14128)

Add transposed matmul support for Relax CUTLASS

* [Unity][TVMScript] emit_te sugar (#14123)

This PR adds R.emit_te meta-programming mechanism to emit a topi operator from TVMScript

* [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc (#14139)

* [Unity][BYOC] Assign group to unused bindings and ignroe PrimFunc

* Update fuse_ops.cc

* [Unity] Add callback to FuseOpsByPattern to check match result is accepted (#14109)

* [Unity] Add callback to FuseOpsByPattern to check match result is accepted

* add callnode to callback args

* update pattern registry

* fix

* [Unity][Legalize] Fix Scalar Constant Legalization (#14127)

This PR fixes the issue of loss of data type during Legalization. Previously, if we use a constant scalar in operators like `multiply`, it will automatically be converted to a python data type variable, which may lose its original data type. For example, `float16` may become python `float` and be interpreted as `float32` later.

This is now fixed by avoiding scalar value conversion. The conversion could be added back once we have better support for scalar prim value.

Co-authored-by: Sunghyun Park <49998730+sunggg@users.noreply.github.com>
Co-authored-by: Wuwei Lin <wuwei@apache.org>

* [Unity][Pass] Enhance constant folding to fold relax ops by evaluating them. (#14146)

* [Unity][Pass] Enhance constant folding to fold relax ops
by evaluating them.

This uses the registered legalization function attached to
the op to lower it to call_tir and uses the existing call_tir
folding mechanism to fold it.

This kind of op folding is only allowed within dataflow block
as ops could have side-effects.

Limitations:
* This currently does not support folding ops
that could lower to multiple call_tir bindings.
* Folding by evaluating ops is not always beneficial.
We need a heuristic to check if it is useful. This is
not implemented yet and folding is always allowed
by evaluating expressions.

* fix ci error

* fix doc

* fix bug

* [Unity][Debugging] AST printer (#14152)

This PR transfers over the AST printer from tlc-pack/relax. The AST printer is a debugging tool that prints out a Relax AST in a precise and human-readable format, which can be helpful for debugging the parser or various passes.

Co-authored-by: Yuchen Jin <yuchenj@cs.washington.edu>
Co-authored-by: Lesheng Jin <34279105+LeshengJin@users.noreply.github.com>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Ruihang Lai <ruihangl@cd.cmu.edu>
Co-authored-by: Tianqi Chen <tqchen@users.noreply.github.com>

* [Unity][Pass] Support Symbolic Shape Deduction during BindParam (#14154)

`BindParam` replace function params to constant nodes. However, it will
drop the shape information of the params, considering the following case:

```python
@R.function
def main(
    x: R.Tensor(("batch", "m"), dtype="float32"),
    w0: R.Tensor(("n", "m"), dtype="float32"),
    b0: R.Tensor(("n",), dtype="float32"),
    w1: R.Tensor(("k", "n"), dtype="float32"),
    b1: R.Tensor(("k",), dtype="float32"),
) -> R.Tensor(("batch", "k"), dtype="float32"):
    batch = T.Var("batch", "int64")
    k = T.Var("k", "int64")
    m = T.Var("m", "int64")
    n = T.Var("n", "int64")
    with R.dataflow():
        lv0 = R.call_tir("linear0", (x, w0, b0), out_sinfo=R.Tensor((batch, n), dtype="float32"))
        out = R.call_tir("linear1", (lv0, w1, b1), out_sinfo=R.Tensor((batch, k), dtype="float32"))
        R.output(out)
    return out
```

The current pass will simply drop the symbolic var `n`, `k` and cause
undefined vars during build as
```python
@R.function
def main(x: R.Tensor((1, "m"), dtype="float32")) -> R.Tensor(dtype="float32", ndim=2):
    m = T.Var("m", "int64")
    n = T.Var("n", "int64")
    k = T.Var("k", "int64")
    with R.dataflow():
        lv0 = R.call_tir("linear0", (x, metadata["relax.expr.Constant"][0], metadata["relax.expr.Constant"][1]), out_sinfo=R.Tensor((1, n), dtype="float32"))
        out = R.call_tir("linear1", (lv0, metadata["relax.expr.Constant"][2], metadata["relax.expr.Constant"][3]), out_sinfo=R.Tensor((1, k), dtype="float32"))
        R.output(out)
    return out
```

This PR updates the pass to bind the symbolic shape during binding.

* [Unity][Analysis] Checking function return struct info in well-formed check (#14155)

The current well-formed misses the check of function return struct info,
which may mistakenly pass the check if there are undefined vars in the
function return struct info.

* [Unity][BYOC] Use Relax legalize + CPU build for reference in tests (#14162)

* clean dnnl test

* clean trt test

* clean cutlass test

* fix gelu legalize for fp16

* use memoize in dnnl and trt tests

* [Unity] Add bind_constants option to FuseOpsByPattern (#14151)

* [Unity] Add lift_constatns option to FuseOpsByPattern

* lift_constants -> bind_constants

* [Unity][Analysis] Analysis for detecting recursion in Relax (#14149)

* DFS based attempt to detect mutual recursion

* Use Johnson's circuit-detecting algorithm instead

* Fix control flow test

* Detect all recursion anyway

* Add new test cases for simple recursion

* Fix mistake in test case

* Include missing dependencies

* Remove trailing whitespace

* Dependencies are simply references, not necessarily calls

* More trailing whitespace

* Newline at end of file

* Fix spacing in docstring

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

---------

Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>

* [Unity][BYOC] Add batch matmul support to Relax CUTLASS BYOC (#14166)

* Add batch matmul support to Relax CUTLASS BYOC

* Allow more dtypes

* Fix tests

* Revert how to get batch attr

* [Unity][Op] Full support of Relax op `power` (#14171)

This PR provides a full support of `R.power` including op registering, legalization, overloading `__power__` for Expr and torch fx frontend.

* [Unity][Analysis] Restore Python bindings for var analyses (#14180)

Restore Python bindings for var analyses

* [Unity][OP] Add an operator for fused multi head attention (#14150)

* [Unity][OP] Add an operator for fused multi head attention

This PR introduces the new relax operator `R.nn.attention` for fused multi head attention, and the support of fused multi head attention to relax cutlass BYOC. The input of the operator are query, key and value tensor, with `BSNH` layout, namely `[batch size, sequence length, number of heads, dimension of heads]`. And the output shares the same layout with all input tensor.

* remove useless codes, remove attrs and add memoize

* add more dispatches

* nit and fix rebase

* fix linter

* add support for bias

* fix lint

* BNSS layout for bias

* update doc

* fix typo

* support bias broadcast

* [Unity][WEBGPU] Codegen improvements and WebRuntime (#14187)

This PR makes various improvements web codegen in relax web runtime.

Correct support of shift operators.
Update relax vm to make most use of internal allocators.
Update the webgpu API to the latest spec.

* [Unity][Transform] LiftTransformParams handling multiple functions (#14192)

Previously, the LiftTransformParams pass only works on function
`"main"`. This is a bit restrictive as in our recent practice on stable
diffusion, there are cases where multiple Relax functions inside an
IRModule all need to be transformed.

Therefore, this PR enhances the LiftTransformParams pass, so that it
will now transform **all** functions **with attribute `num_input`**. For
functions without this attribute, the pass will simply skip them.

* [Unity][Op] Group normalization (#14194)

* [TOPI] Group normalization

As more and more ML models nowadays contain the group normalization
computation, we find it beneficial to introduce this op to TOPI level.
It will enable us to optimize the group normalization operation as a
whole in a more convenient way.

This PR introduces the group normalization op to TOPI. The group norm
operation was introduced in https://arxiv.org/abs/1803.08494. The
implementation uses tuple reduction, same as the implementation of layer
norm. Implemented with tuple reduction, the corresponding generated TIR
function can be optimized by cross-thread reduction or rfactor through
MetaSchedule.

Prior to this PR, the group normalization operations in frontend models
are translated to a series of operations, which brings inconvenience
when we want to optimize the group norm op as a whole.

With the TOPI implementation of group norm being introduced by #14193,
we can now use it to legalize the high-level group norm op and optimize
it using cross-thread reduction or rfactor via MetaSchedule.


Co-authored-by: Bohan Hou <spectrometerh@gmail.com>

* [Unity][Op] Argmax and argmin (#14195)

This PR introduces full support to the argmax and argmin op to the unity
branch, including the structure info inference, the legalization, and
the translation from Torch FX.

* [Unity][Op] Legalize `round`, `floor`, `ceil`, `sign` (#14198)

This PR implements the legalization for four unary operators:
* round,
* floor,
* ceil,
* sign.

Unit tests are provided accordingly.

* [Unity][Frontend] FX translator supporting more ops (#14196)

This PR improves the torch FX translator in the following perspectives:
* support unary op `sigmoid` and `round`,
* support in-place `fill`, `triu` and `tril`,
* support `tensor`, `arange`, `empty`,
* support `bmm` (batch matrix multiplication),
* support `astype`,
* support `chunk` and `squeeze`.

This PR also fixes `Embedding`. Previously the translation assumes that
the input to Embedding will only be 1-dimensional, and will throw
exception when the input has more than one dimension (i.e., batched).
This PR brings the support.

* [Unity][Frontend] FX translator returning weights with `keep_params_as_input` (#14197)

PR #14067 introduces the flag `keep_params_as_input` to the FX
translator, in the purpose to handle to model weights outside of the
translated Relax function.

This PR takes a further step, by returning the model weights as
NDArrays when the flag `keep_params_as_input` is true. With this PR, the
translator now can return back the weights upon requested. Otherwise,
after the import we will lose the model weights in the given PyTorch
model.

* [Unity][Fix] FX translating dtype (#14201)

This PR fixes a bug of the current FX translator when dealing with
dtype.

Previously, the translator does not take the cases
```python
dtype = x.getattr("dtype")
```
into consideration. In this case, the dtype will be a fx.Node object,
while the translator assumes that the dtype is either a string or
a torch native datatype (e.g., torch.float32).

This PR fixes this by doing an environment table lookup before for all
dtypes.

* [Unity][TIR][Pass] ForceNarrowIndexToInt32 (#14203)

[TIR][Pass] ForceNarrowIndexToInt32

This PR introduces a pass which forces every index expression in a
PrimFunc to have dtype int32. Meanwhile, it also checks if all integer
buffers in the PrimFunc have int32 dtype, and report error if some
integer buffer has dtype other than int32.

In terms of implementation, this pass leverages the
IndexDataTypeNormalizer, with the target dtype being int32.

This PR contains a few basic tests that come from
`test_tir_transform_narrow_datatype.py`, and contains some negative
tests as well.

* [Unity][Frontend] FX translator support torch.baddbmm (#14202)

This PR brings the support of translating `torch.baddbmm` into
combination of operators (matmul, add, multiply). Unit tests
are provided accordingly.

This PR also fixes the kwarg fetching issue of `torch.interpolate`.

* [CI] Point cpu ci to dep with onnx (#40)

Point cpu ci to dep with onnx

* [Unity] Introduce Default GPU Schedule Pass (#14182)

* Implement default schedule.

* Add test.

* Add tests.

* Fix linting.

* Skip scheduled blocks.

* Address issues.

* Use target current.

* Minor fixes.

* Remove Mutator.

* Move pas…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants