[TE] Optimized version of concatenation layer #11341

shtinsa · 2022-05-17T15:05:55Z

 1. Concat implemented using extern_op
 2. New tests added.
 3. Workaround to allow inline extern_op-s with other layers.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

shtinsa · 2022-05-18T16:42:44Z

Hello @masahi could you please review this PR?

masahi · 2022-05-18T20:10:22Z

Can you describe the motivation of the new implementation (I'm fully aware, but for others), and some perf data on real-world models.

cc @tkonolige @altanh

python/tvm/relay/op/strategy/generic.py

src/te/schedule/schedule_dataflow_rewrite.cc

tkonolige

Thanks for this PR! Better concatenate performance would be great. And being able to fuse extern ops is even better!

I'm not sure I understand why/how the new inlining works. Could you explain (and put that explanation in the comment for where it happens)?

python/tvm/topi/x86/concat.py

src/relay/op/tensor/transform.cc

src/te/schedule/schedule_dataflow_rewrite.cc

tests/python/relay/test_op_level1.py

shtinsa · 2022-05-19T06:27:43Z

The reason of this implementation is based on DLRM model performance measurements where the concatenation layer takes about 20-25% of time budget for single threaded execution. The problem is connected with 27 concatenating objects. The bottleneck is related to source tensor search and current TE implementation requires O(N) operations to define source for copying. The resulting TIR source search script has following form (this is an internal part of the loop):
@tir.if_then_else((((((((((((((((((((((((((((ax1: int32 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0: int32, ((((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((ax1 - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((ax1 - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((ax1 - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((ax1 - 64) - 64) - 64) - 64)], @tir.if_then_else(((((ax1 - 64) - 64) - 64) >= 0), placeholder[ax0, (((ax1 - 64) - 64) - 64)], @tir.if_then_else((((ax1 - 64) - 64) >= 0), placeholder[ax0, ((ax1 - 64) - 64)], @tir.if_then_else(((ax1 - 64) >= 0), placeholder[ax0, (ax1 - 64)], placeholder[ax0, ax1], dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32)

tkonolige

Looks good to me. But I am unsure if the changes to inlining will cause any problems. Maybe @tqchen or @masahi can can verify that it is ok.

shtinsa · 2022-05-23T16:39:15Z

I prepared environment for the performance testing and suppose to provide latency/throughput curves for different configurations tomorrow.

The picture above provides performance improvements in concatenation for DLRM model. The max throughput/min latency values are following:
Main branch:
max throughput 4481.75, i:95, t:1
min latency 1.15, i:1, t:20
New concat branch, concatenation is inlined:
max throughput 5471.46, i:95, t:1
min latency 1.1, i:1, t:17
New concat branch, concatenation is opaque:
max throughput 5681.72, i:92, t:1
min latency 1.0, i:1, t:12

I'm trying to identify the root-cause of performance drop for inlined concatenation and it looks like the problem is connected with reshape layer. In case of inlined version this layer is not removed from the pipeline and it leads to performance drop.

masahi · 2022-05-24T07:04:32Z

@shtinsa need to fix the conflict.

masahi · 2022-05-24T07:06:15Z

python/tvm/relay/op/strategy/generic.py

+    strategy = _op.OpStrategy()
+    strategy.add_implementation(
+        wrap_compute_concat(topi.concatenate),
+        wrap_topi_schedule(topi.generic.schedule_extern),


I think this shouldn't be schedule_extern.

Shoud I use schedule_injective? But it works only with llvm codegen.

Yes, it should be schedule_injective. Things in generic.py are only used by cpu targets (cuda has its own strategies for concat etc).

I tried, the result is here: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-11341/26/pipeline/456

25-th build was ok.

yeah, I think you also need to define concatenate_strategy in strategy/cuda.py, and use topi.cuda.schedule_injective there. That should allow using schedule_injective here.

In general, if you add a new strategy in generic.py, you should also update gpu strategy in strategy/cuda.py as well.

python/tvm/topi/x86/concat.py

python/tvm/relay/op/strategy/x86.py

masahi · 2022-05-24T07:15:11Z

python/tvm/topi/x86/injective.py

 from ..utils import is_empty_shape


-def schedule_injective_from_existing(sch, out):
+def schedule_injective_from_existing_ref(sch, out):


This change and from tvm.topi.generic.injective import schedule_injective_from_existing above don't seem necessary.

Unfortunately code reverting affects test_pass_partition_graph.py::test_multi_node_compiler test

why is that? schedule_injective_from_existing that you import at L22 doesn't seem to be used, and the other change is just renaming.

I tried to remove it, test fails with crash: File "/home/sshtin/Dev/tvm/python/tvm/topi/x86/init.py", line 30, in
from .reduction import *
File "/home/sshtin/Dev/tvm/python/tvm/topi/x86/reduction.py", line 21, in
from .injective import schedule_injective_from_existing
ImportError: Error importing plugin "tvm.testing.plugin": cannot import name 'schedule_injective_from_existing' from 'tvm.topi.x86.injective' (/home/sshtin/Dev/tvm/python/tvm/topi/x86/in

I don't know what that error means, but you shouldn't be making this change anyway. If the old schedule_injective_from_existing is used by other places, this change will make a different schedule_injective_from_existing (in tvm.topi.generic.injective) be called instead.

Why you cannot use the old schedule_injective_from_existing at L150? And since tvm.topi.generic.injective.schedule_injective_from_existing is just a one line schedule,

tvm/python/tvm/topi/generic/injective.py

Line 40 in 8131364

sch[out].fuse(*sch[out].op.axis)

you can directly copy that in your concat schedule, to avoid importing and the name conflict issue.

shtinsa · 2022-05-25T17:44:16Z

@shtinsa need to fix the conflict.

Resolved

masahi · 2022-05-30T07:11:01Z

@shtinsa If the opaque one is faster, can we make your new concat always opaque and avoid the hack in schedule_dataflow_rewrite.cc?

1. Concat implemented using extern_op 2. New tests added. 3. Workaround to allow inline extern_op-s with other layers.

shtinsa · 2022-05-30T16:12:32Z

If the opaque one is faster, can we make your new concat always opaque and avoid the hack in schedule_dataflow_rewrite.cc?

Formally yes, but it will be necessary to disable some tests in python/relay/test_pass_fuse_ops.py which check ability to fuse concatenate layer.

masahi

Please improve the writing of comments in schedule_dataflow_rewrite.cc.

masahi · 2022-05-30T20:14:55Z

python/tvm/topi/x86/injective.py

@@ -60,14 +62,12 @@ def schedule_injective_from_existing(sch, out):


 def schedule_injective(outs):
-    """X86 schedule for injective op.
-
+    """X86 reference schedule for injective op.


Remove this diff

masahi · 2022-05-30T20:15:13Z

python/tvm/topi/x86/injective.py

+    Parameters
+    ----------
+    outs: Array of Tensor
+          The computation graph description of injective in the format


not injective

masahi · 2022-05-30T20:17:44Z

tests/python/unittest/test_micro_model_library_format.py

+                    "device": 1,
+                    "io_size_bytes": 4800,
+                    "workspace_size_bytes": 1248,
+                }


Why does this change affect workspace_size_bytes?

The concat code has 2 arrays and access to these arrays require 8 bytes for x64 and 4 bytes for 32 bits systems. That should be reason of difference. Should I recheck it?

masahi · 2022-05-30T20:18:21Z

src/te/schedule/schedule_dataflow_rewrite.cc

@@ -525,8 +548,13 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) {
        for (auto iv : compute->axis) {
          args.push_back(iv->var);
        }
+        if (ext_ops.find(stage->op) != ext_ops.end()) {
+          // sshtin: The extern op can try to get access to the input tensors as a row data,
+          // that can lead to error in TE scripts.


What do you mean by TE scripts?

Maybe here you mean "IR builder" specifically?

masahi · 2022-05-30T20:19:29Z

src/te/schedule/schedule_dataflow_rewrite.cc

@@ -511,6 +511,29 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) {
  std::vector<bool> changed(sch->stages.size(), false);
  std::vector<Stmt> new_hybrid_body(sch->stages.size());
  std::vector<bool> hybrid_changed(sch->stages.size(), false);
+  // (sshtin): this workaround allows to inline extern ops.
+  // All inputs for extern op should not be inlined because inlining happens
+  // before generation of TE script for particular extern op. That may lead to


What do you mean by TE scripts?

These are gen_ir() or gen_ir_1d() scripts. The invocation of these scripts may happen after inlining pass.

How about simply dropping script? I don't think "TE script" is a common term. And "generation of TE" sounds ok.

masahi · 2022-05-30T20:20:37Z

src/te/schedule/schedule_dataflow_rewrite.cc

+  // before generation of TE script for particular extern op. That may lead to
+  // crash during lowering or building stages.
+  // The problem description:
+  // In case of operations fuzing arguments inlining


masahi · 2022-05-30T20:21:26Z

src/te/schedule/schedule_dataflow_rewrite.cc

+  // The problem description:
+  // In case of operations fuzing arguments inlining
+  // prevents creation of ProducerNode for extern operation.
+  // Instead of the creation it supposed to use operation argument as inlined buffer


it is supposed to

masahi · 2022-05-30T20:23:27Z

src/te/schedule/schedule_dataflow_rewrite.cc

@@ -511,6 +511,29 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) {
  std::vector<bool> changed(sch->stages.size(), false);
  std::vector<Stmt> new_hybrid_body(sch->stages.size());
  std::vector<bool> hybrid_changed(sch->stages.size(), false);
+  // (sshtin): this workaround allows to inline extern ops.


Is this referring to "inline into extern ops", or "inline extern ops into their consumer"?

masahi · 2022-06-01T07:43:42Z

src/te/schedule/schedule_dataflow_rewrite.cc

@@ -525,8 +548,13 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) {
        for (auto iv : compute->axis) {
          args.push_back(iv->var);
        }
+        if (ext_ops.find(stage->op) != ext_ops.end()) {
+          // sshtin: The extern op can try to get access to the input tensors as a row data,


sorry one more comment: Do you mean "raw data" here, or what is "row data" otherwise?

Oh yes :) it is looks like a CV phantom, Fixed

masahi · 2022-06-01T10:46:24Z

@shtinsa Need to run another job.

altanh · 2022-06-17T21:09:14Z

I prepared environment for the performance testing and suppose to provide latency/throughput curves for different configurations tomorrow. The picture above provides performance improvements in concatenation for DLRM model. The max throughput/min latency values are following: Main branch: max throughput 4481.75, i:95, t:1 min latency 1.15, i:1, t:20 New concat branch, concatenation is inlined: max throughput 5471.46, i:95, t:1 min latency 1.1, i:1, t:17 New concat branch, concatenation is opaque: max throughput 5681.72, i:92, t:1 min latency 1.0, i:1, t:12

I'm trying to identify the root-cause of performance drop for inlined concatenation and it looks like the problem is connected with reshape layer. In case of inlined version this layer is not removed from the pipeline and it leads to performance drop.

is the performance testing script available? I think this concat change might have caused some regressions on some vision models, so just wanted to see if I can replicate the results locally

altanh · 2022-06-17T23:55:18Z

I extracted some concat workloads from a few vision models we test and ran it locally on a 5900X, and I got these results. Blue ("A") is the commit before this PR, and orange ("B") is this one. Perhaps the sizes are too small to benefit from the handwritten kernel?

workloads: https://gist.github.com/altanh/bccac6bf04393bbdae0f8440c59567e6

shtinsa · 2022-06-20T13:20:19Z

It can be several reasons of the issue (for example the copying data block is not aligned to SIMD size) so I'll check performance of the kernels from the json.

DzAvril · 2022-06-21T03:18:05Z

python/tvm/topi/x86/concat.py

+    dtype = data[0].dtype
+    out_shape = data[0].shape[:axis] + [join_size] + data[0].shape[axis + 1 :]
+    in_outers_tensor = const_vector(in_outers)
+    in_cumsum_tensor = const_vector(in_outers_cumsum, name="cumsum")


Hi shtinsa, why make in_outers_tensor and in_cumsum_tensor as te.tensor.Tensor here? Function const_vector brings select in lowered tir. In my test, I kept them as lists of int and passed them to the callback function, the select was gone and it was faster than te.tensor.Tensor

Hello @DzAvril I analyzed compiled so files and disasm code, and code block for one concatenation looks like this:

v384 = v311 & 0xFFFFFFFFFFFFFFF0LL; _RSI = v304 + 4 * (v276 + v305); _RDX = 0LL; do { __asm { vmovups ymm0, ymmword ptr [rax+rdx*4-20h] vmovups ymm1, ymmword ptr [rax+rdx*4] vmovups ymmword ptr [rsi+rdx*4-20h], ymm0 vmovups ymmword ptr [rsi+rdx*4], ymm1 } _RDX += 16LL; } while ( v384 != _RDX ); _RSI = v311 & 0xFFFFFFFFFFFFFFF0LL; if ( v311 != v384 ) goto LABEL_209; LABEL_211: if ( v310 <= 0 ) goto LABEL_219; v285 = v306[25]; if ( (unsigned __int64)v310 < 0x10 ) { _RSI = 0LL; LABEL_217: _RDX = v302 + 4 * v285; do { _RCX = v356; __asm { vmovss xmm0, dword ptr [rcx+rsi*4] vmovss dword ptr [rdx+rsi*4], xmm0 } ++_RSI; } while ( v310 != _RSI ); goto LABEL_219; }

So formally I would add some unrolling to copy loop and remove tiles evaluation for data-blocks proportional to SIMD line. But it is a very small improvement which should be implemented on codegen side. Anyway I'm going to check the performance of your's proposals.

I can confirm this. We are currently working on a PR to change the behavior here.
Just as a reference the comparison of the resulting C code with plain list of ints

for (int32_t j = 0; j < 4; ++j) { concatenate_ext[j] = placeholder[j]; } for (int32_t j1 = 0; j1 < 4; ++j1) { concatenate_ext[(j1 + 4)] = placeholder1[j1]; } return 0;

and with te.tensor.Tensor:

void* const_vector_let = (&(global_workspace_6_var[64])); void* cumsum_let = (&(global_workspace_6_var[48])); for (int32_t i = 0; i < 2; ++i) { ((int64_t*)const_vector_let)[i] = ((i == 1) ? (int64_t)4 : ((i == 0) ? (int64_t)4 : (int64_t)0)); } for (int32_t i1 = 0; i1 < 2; ++i1) { ((int64_t*)cumsum_let)[i1] = ((i1 == 1) ? (int64_t)4 : (int64_t)0); } int64_t cumsum_let[2] = {0, 4}; for (int64_t j = 0; j < ((int64_t*)const_vector_let)[0]; ++j) { concatenate_ext[(((int64_t*)cumsum_let)[0] + j)] = placeholder[j]; } for (int64_t j1 = 0; j1 < ((int64_t*)const_vector_let)[1]; ++j1) { concatenate_ext[(((int64_t*)cumsum_let)[1] + j1)] = placeholder1[j1]; } return 0;

@UlrikHjort-Bosch @vdkhoi @MichaelJKlaiber

I see, That c code looks better but I tested "llvm" target, so that may be a difference in output.
The same time I should notice that select operator is used for filling up the indices table and this code can be excluded from the execution pipeline in case of static shaping. I.e. these tensors can be implemented as const buffers pre-allocated within the data section, but for dynamic shaping this improvement may have effect especially for the small data blocks.

How about I implement the other version and we discuss what is best for all purposes then?

I added comment to #11800 (comment)

DzAvril · 2022-06-23T08:57:11Z

Hi @shtinsa , the idea of pre-compute the coordinate mapping between output and inputs is great. But there is a pity that the extern_op cannot be inlined in the fused subgraph so that brings extra load and store effort. My question is why do you implement this by IRBuilder instead of by TE?

shtinsa · 2022-06-23T10:19:23Z

Hello @DzAvril, Problem is related to splitting of TE expression onto sub-areas, so having f(x1, ..., xk) as output is is a bit problematic to define particular input tensor from x1,..., xk coordinates. Using hashing function from the axis coordinates may require more intermediate memory.
Frankly speaking the concatenation price should be 0 in many cases but it may require a lot of efforts to implement this solution because it will be necessary to update whole SW stack within TVM.

masahi self-assigned this May 18, 2022

masahi reviewed May 18, 2022

View reviewed changes

python/tvm/relay/op/strategy/generic.py Outdated Show resolved Hide resolved

python/tvm/relay/op/strategy/generic.py Outdated Show resolved Hide resolved

src/te/schedule/schedule_dataflow_rewrite.cc Show resolved Hide resolved

tkonolige requested changes May 18, 2022

View reviewed changes

shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch from b4853f5 to 2e3df18 Compare May 20, 2022 17:53

tkonolige approved these changes May 23, 2022

View reviewed changes

masahi reviewed May 24, 2022

View reviewed changes

shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch 5 times, most recently from 8cf733a to 281cf28 Compare May 25, 2022 11:44

shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch 3 times, most recently from a70e187 to 85c0c06 Compare May 27, 2022 13:47

shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch from 70052c2 to 1bc3b03 Compare May 30, 2022 09:28

Sergey Shtin added 8 commits May 30, 2022 17:07

[TE] Optimized version of concatenation layer

f2d18e4

1. Concat implemented using extern_op 2. New tests added. 3. Workaround to allow inline extern_op-s with other layers.

*test fix

cd1fbd8

test_any.py fix.

7c37a4b

test_forward.py from tensorflow fix.

fefb4af

lint fix.

ae64002

Fixes after code review.

cab5fbb

New comment added.

1a01771

Lint fix.

e000d27

Sergey Shtin added 9 commits May 30, 2022 17:07

Another lint fix.

a350af1

Comments added.

b0d742d

rebase issue fix.

bfbcb86

Restored previous state.

14e8b70

Update after code review.

3ec0d76

After code review changes.

835e8a1

lint review.

2199e43

Change strategy for cuda to fix tests.

d474d16

Rebase to main

37250d3

shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch from 0898dbf to 37250d3 Compare May 30, 2022 14:17

masahi requested changes May 30, 2022

View reviewed changes

Sergey Shtin added 2 commits May 31, 2022 15:59

Comments changes after review.

a2c9682

Some more comments fixes.

dd8d1db

masahi reviewed Jun 1, 2022

View reviewed changes

One more error fix in comments.

213c3c6

masahi approved these changes Jun 1, 2022

View reviewed changes

restart build

ef94d6f

masahi merged commit e84f163 into apache:main Jun 1, 2022

DzAvril reviewed Jun 21, 2022

View reviewed changes

SebastianBoblest mentioned this pull request Jun 21, 2022

Change new concat #11800

Merged

masahi mentioned this pull request Jun 25, 2022

[Bug] concat([x], axis=1) return random results #11895

Closed

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

masahi mentioned this pull request Nov 4, 2022

[TE] Make elem_offset of the buffers created by te.extern a variable to avoid crash #13297

Merged

[TE] Optimized version of concatenation layer #11341

[TE] Optimized version of concatenation layer #11341

Conversation

shtinsa commented May 17, 2022

shtinsa commented May 18, 2022

masahi commented May 18, 2022 • edited Loading

tkonolige left a comment

Choose a reason for hiding this comment

shtinsa commented May 19, 2022 • edited Loading

tkonolige left a comment

Choose a reason for hiding this comment

shtinsa commented May 23, 2022 • edited Loading

masahi commented May 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi May 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shtinsa commented May 25, 2022 • edited Loading

masahi commented May 30, 2022

shtinsa commented May 30, 2022

masahi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Jun 1, 2022

altanh commented Jun 17, 2022

altanh commented Jun 17, 2022

shtinsa commented Jun 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shtinsa Jun 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DzAvril commented Jun 23, 2022

shtinsa commented Jun 23, 2022

masahi commented May 18, 2022 •

edited

Loading

shtinsa commented May 19, 2022 •

edited

Loading

shtinsa commented May 23, 2022 •

edited

Loading

masahi May 30, 2022 •

edited

Loading

shtinsa commented May 25, 2022 •

edited

Loading

shtinsa Jun 21, 2022 •

edited

Loading