Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TE] Optimized version of concatenation layer #11341

Merged
merged 21 commits into from
Jun 1, 2022

Conversation

shtinsa
Copy link
Contributor

@shtinsa shtinsa commented May 17, 2022

 1. Concat implemented using extern_op
 2. New tests added.
 3. Workaround to allow inline extern_op-s with other layers.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@shtinsa
Copy link
Contributor Author

shtinsa commented May 18, 2022

Hello @masahi could you please review this PR?

@masahi
Copy link
Member

masahi commented May 18, 2022

Can you describe the motivation of the new implementation (I'm fully aware, but for others), and some perf data on real-world models.

cc @tkonolige @altanh

@masahi masahi self-assigned this May 18, 2022
python/tvm/relay/op/strategy/generic.py Outdated Show resolved Hide resolved
python/tvm/relay/op/strategy/generic.py Outdated Show resolved Hide resolved
src/te/schedule/schedule_dataflow_rewrite.cc Show resolved Hide resolved
Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! Better concatenate performance would be great. And being able to fuse extern ops is even better!

I'm not sure I understand why/how the new inlining works. Could you explain (and put that explanation in the comment for where it happens)?

python/tvm/topi/x86/concat.py Outdated Show resolved Hide resolved
src/relay/op/tensor/transform.cc Outdated Show resolved Hide resolved
src/te/schedule/schedule_dataflow_rewrite.cc Show resolved Hide resolved
tests/python/relay/test_op_level1.py Outdated Show resolved Hide resolved
tests/python/relay/test_op_level1.py Outdated Show resolved Hide resolved
tests/python/relay/test_op_level1.py Outdated Show resolved Hide resolved
@shtinsa
Copy link
Contributor Author

shtinsa commented May 19, 2022

The reason of this implementation is based on DLRM model performance measurements where the concatenation layer takes about 20-25% of time budget for single threaded execution. The problem is connected with 27 concatenating objects. The bottleneck is related to source tensor search and current TE implementation requires O(N) operations to define source for copying. The resulting TIR source search script has following form (this is an internal part of the loop):
@tir.if_then_else((((((((((((((((((((((((((((ax1: int32 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0: int32, ((((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((((ax1 - 64) - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else(((((((ax1 - 64) - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, (((((ax1 - 64) - 64) - 64) - 64) - 64)], @tir.if_then_else((((((ax1 - 64) - 64) - 64) - 64) >= 0), placeholder[ax0, ((((ax1 - 64) - 64) - 64) - 64)], @tir.if_then_else(((((ax1 - 64) - 64) - 64) >= 0), placeholder[ax0, (((ax1 - 64) - 64) - 64)], @tir.if_then_else((((ax1 - 64) - 64) >= 0), placeholder[ax0, ((ax1 - 64) - 64)], @tir.if_then_else(((ax1 - 64) >= 0), placeholder[ax0, (ax1 - 64)], placeholder[ax0, ax1], dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32), dtype=float32)

@shtinsa shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch from b4853f5 to 2e3df18 Compare May 20, 2022 17:53
Copy link
Contributor

@tkonolige tkonolige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. But I am unsure if the changes to inlining will cause any problems. Maybe @tqchen or @masahi can can verify that it is ok.

@shtinsa
Copy link
Contributor Author

shtinsa commented May 23, 2022

I prepared environment for the performance testing and suppose to provide latency/throughput curves for different configurations tomorrow.
res_DLRM 100G
The picture above provides performance improvements in concatenation for DLRM model. The max throughput/min latency values are following:
Main branch:
max throughput 4481.75, i:95, t:1
min latency 1.15, i:1, t:20
New concat branch, concatenation is inlined:
max throughput 5471.46, i:95, t:1
min latency 1.1, i:1, t:17
New concat branch, concatenation is opaque:
max throughput 5681.72, i:92, t:1
min latency 1.0, i:1, t:12

I'm trying to identify the root-cause of performance drop for inlined concatenation and it looks like the problem is connected with reshape layer. In case of inlined version this layer is not removed from the pipeline and it leads to performance drop.

@masahi
Copy link
Member

masahi commented May 24, 2022

@shtinsa need to fix the conflict.

strategy = _op.OpStrategy()
strategy.add_implementation(
wrap_compute_concat(topi.concatenate),
wrap_topi_schedule(topi.generic.schedule_extern),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shouldn't be schedule_extern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoud I use schedule_injective? But it works only with llvm codegen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be schedule_injective. Things in generic.py are only used by cpu targets (cuda has its own strategies for concat etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think you also need to define concatenate_strategy in strategy/cuda.py, and use topi.cuda.schedule_injective there. That should allow using schedule_injective here.

In general, if you add a new strategy in generic.py, you should also update gpu strategy in strategy/cuda.py as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

python/tvm/topi/x86/concat.py Outdated Show resolved Hide resolved
python/tvm/relay/op/strategy/x86.py Outdated Show resolved Hide resolved
from ..utils import is_empty_shape


def schedule_injective_from_existing(sch, out):
def schedule_injective_from_existing_ref(sch, out):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change and from tvm.topi.generic.injective import schedule_injective_from_existing above don't seem necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately code reverting affects test_pass_partition_graph.py::test_multi_node_compiler test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is that? schedule_injective_from_existing that you import at L22 doesn't seem to be used, and the other change is just renaming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove it, test fails with crash: File "/home/sshtin/Dev/tvm/python/tvm/topi/x86/init.py", line 30, in
from .reduction import *
File "/home/sshtin/Dev/tvm/python/tvm/topi/x86/reduction.py", line 21, in
from .injective import schedule_injective_from_existing
ImportError: Error importing plugin "tvm.testing.plugin": cannot import name 'schedule_injective_from_existing' from 'tvm.topi.x86.injective' (/home/sshtin/Dev/tvm/python/tvm/topi/x86/in

Copy link
Member

@masahi masahi May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what that error means, but you shouldn't be making this change anyway. If the old schedule_injective_from_existing is used by other places, this change will make a different schedule_injective_from_existing (in tvm.topi.generic.injective) be called instead.

Why you cannot use the old schedule_injective_from_existing at L150? And since tvm.topi.generic.injective.schedule_injective_from_existing is just a one line schedule,

sch[out].fuse(*sch[out].op.axis)

you can directly copy that in your concat schedule, to avoid importing and the name conflict issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

@shtinsa shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch 5 times, most recently from 8cf733a to 281cf28 Compare May 25, 2022 11:44
@shtinsa
Copy link
Contributor Author

shtinsa commented May 25, 2022

@shtinsa need to fix the conflict.

Resolved

@shtinsa shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch 3 times, most recently from a70e187 to 85c0c06 Compare May 27, 2022 13:47
@masahi
Copy link
Member

masahi commented May 30, 2022

@shtinsa If the opaque one is faster, can we make your new concat always opaque and avoid the hack in schedule_dataflow_rewrite.cc?

@shtinsa shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch from 70052c2 to 1bc3b03 Compare May 30, 2022 09:28
Sergey Shtin added 8 commits May 30, 2022 17:07
@shtinsa shtinsa force-pushed the sshtin/concat_optimization_for_DLRM branch from 0898dbf to 37250d3 Compare May 30, 2022 14:17
@shtinsa
Copy link
Contributor Author

shtinsa commented May 30, 2022

If the opaque one is faster, can we make your new concat always opaque and avoid the hack in schedule_dataflow_rewrite.cc?

Formally yes, but it will be necessary to disable some tests in python/relay/test_pass_fuse_ops.py which check ability to fuse concatenate layer.

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please improve the writing of comments in schedule_dataflow_rewrite.cc.

@@ -60,14 +62,12 @@ def schedule_injective_from_existing(sch, out):


def schedule_injective(outs):
"""X86 schedule for injective op.

"""X86 reference schedule for injective op.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this diff

Parameters
----------
outs: Array of Tensor
The computation graph description of injective in the format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not injective

"device": 1,
"io_size_bytes": 4800,
"workspace_size_bytes": 1248,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this change affect workspace_size_bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concat code has 2 arrays and access to these arrays require 8 bytes for x64 and 4 bytes for 32 bits systems. That should be reason of difference. Should I recheck it?

@@ -525,8 +548,13 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) {
for (auto iv : compute->axis) {
args.push_back(iv->var);
}
if (ext_ops.find(stage->op) != ext_ops.end()) {
// sshtin: The extern op can try to get access to the input tensors as a row data,
// that can lead to error in TE scripts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by TE scripts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here you mean "IR builder" specifically?

@@ -511,6 +511,29 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) {
std::vector<bool> changed(sch->stages.size(), false);
std::vector<Stmt> new_hybrid_body(sch->stages.size());
std::vector<bool> hybrid_changed(sch->stages.size(), false);
// (sshtin): this workaround allows to inline extern ops.
// All inputs for extern op should not be inlined because inlining happens
// before generation of TE script for particular extern op. That may lead to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by TE scripts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are gen_ir() or gen_ir_1d() scripts. The invocation of these scripts may happen after inlining pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about simply dropping script? I don't think "TE script" is a common term. And "generation of TE" sounds ok.

// before generation of TE script for particular extern op. That may lead to
// crash during lowering or building stages.
// The problem description:
// In case of operations fuzing arguments inlining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fusing

// The problem description:
// In case of operations fuzing arguments inlining
// prevents creation of ProducerNode for extern operation.
// Instead of the creation it supposed to use operation argument as inlined buffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is supposed to

@@ -511,6 +511,29 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) {
std::vector<bool> changed(sch->stages.size(), false);
std::vector<Stmt> new_hybrid_body(sch->stages.size());
std::vector<bool> hybrid_changed(sch->stages.size(), false);
// (sshtin): this workaround allows to inline extern ops.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this referring to "inline into extern ops", or "inline extern ops into their consumer"?

@@ -525,8 +548,13 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) {
for (auto iv : compute->axis) {
args.push_back(iv->var);
}
if (ext_ops.find(stage->op) != ext_ops.end()) {
// sshtin: The extern op can try to get access to the input tensors as a row data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry one more comment: Do you mean "raw data" here, or what is "row data" otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes :) it is looks like a CV phantom, Fixed

@masahi
Copy link
Member

masahi commented Jun 1, 2022

@shtinsa Need to run another job.

@masahi masahi merged commit e84f163 into apache:main Jun 1, 2022
@altanh
Copy link
Contributor

altanh commented Jun 17, 2022

I prepared environment for the performance testing and suppose to provide latency/throughput curves for different configurations tomorrow. res_DLRM 100G The picture above provides performance improvements in concatenation for DLRM model. The max throughput/min latency values are following: Main branch: max throughput 4481.75, i:95, t:1 min latency 1.15, i:1, t:20 New concat branch, concatenation is inlined: max throughput 5471.46, i:95, t:1 min latency 1.1, i:1, t:17 New concat branch, concatenation is opaque: max throughput 5681.72, i:92, t:1 min latency 1.0, i:1, t:12

I'm trying to identify the root-cause of performance drop for inlined concatenation and it looks like the problem is connected with reshape layer. In case of inlined version this layer is not removed from the pipeline and it leads to performance drop.

is the performance testing script available? I think this concat change might have caused some regressions on some vision models, so just wanted to see if I can replicate the results locally

@altanh
Copy link
Contributor

altanh commented Jun 17, 2022

concat_A_B

I extracted some concat workloads from a few vision models we test and ran it locally on a 5900X, and I got these results. Blue ("A") is the commit before this PR, and orange ("B") is this one. Perhaps the sizes are too small to benefit from the handwritten kernel?

workloads: https://gist.github.com/altanh/bccac6bf04393bbdae0f8440c59567e6

@shtinsa
Copy link
Contributor Author

shtinsa commented Jun 20, 2022

It can be several reasons of the issue (for example the copying data block is not aligned to SIMD size) so I'll check performance of the kernels from the json.

dtype = data[0].dtype
out_shape = data[0].shape[:axis] + [join_size] + data[0].shape[axis + 1 :]
in_outers_tensor = const_vector(in_outers)
in_cumsum_tensor = const_vector(in_outers_cumsum, name="cumsum")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi shtinsa, why make in_outers_tensor and in_cumsum_tensor as te.tensor.Tensor here? Function const_vector brings select in lowered tir. In my test, I kept them as lists of int and passed them to the callback function, the select was gone and it was faster than te.tensor.Tensor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DzAvril I analyzed compiled so files and disasm code, and code block for one concatenation looks like this:

  v384 = v311 & 0xFFFFFFFFFFFFFFF0LL;
     _RSI = v304 + 4 * (v276 + v305);
     _RDX = 0LL;
     do
     {
       __asm
       {
         vmovups ymm0, ymmword ptr [rax+rdx*4-20h]
         vmovups ymm1, ymmword ptr [rax+rdx*4]
         vmovups ymmword ptr [rsi+rdx*4-20h], ymm0
         vmovups ymmword ptr [rsi+rdx*4], ymm1
       }
       _RDX += 16LL;
     }
     while ( v384 != _RDX );
     _RSI = v311 & 0xFFFFFFFFFFFFFFF0LL;
     if ( v311 != v384 )
       goto LABEL_209;
LABEL_211:
     if ( v310 <= 0 )
       goto LABEL_219;
     v285 = v306[25];
     if ( (unsigned __int64)v310 < 0x10 )
     {
       _RSI = 0LL;
LABEL_217:
       _RDX = v302 + 4 * v285;
       do
       {
         _RCX = v356;
         __asm
         {
           vmovss  xmm0, dword ptr [rcx+rsi*4]
           vmovss  dword ptr [rdx+rsi*4], xmm0
         }
         ++_RSI;
       }
       while ( v310 != _RSI );
       goto LABEL_219;
     }

So formally I would add some unrolling to copy loop and remove tiles evaluation for data-blocks proportional to SIMD line. But it is a very small improvement which should be implemented on codegen side. Anyway I'm going to check the performance of your's proposals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm this. We are currently working on a PR to change the behavior here.
Just as a reference the comparison of the resulting C code with plain list of ints

 for (int32_t j = 0; j < 4; ++j) {
    concatenate_ext[j] = placeholder[j];
  }
  for (int32_t j1 = 0; j1 < 4; ++j1) {
    concatenate_ext[(j1 + 4)] = placeholder1[j1];
  }
  return 0;

and with te.tensor.Tensor:

  void* const_vector_let = (&(global_workspace_6_var[64]));
  void* cumsum_let = (&(global_workspace_6_var[48]));
  for (int32_t i = 0; i < 2; ++i) {
    ((int64_t*)const_vector_let)[i] = ((i == 1) ? (int64_t)4 : ((i == 0) ? (int64_t)4 : (int64_t)0));
  }
  for (int32_t i1 = 0; i1 < 2; ++i1) {
    ((int64_t*)cumsum_let)[i1] = ((i1 == 1) ? (int64_t)4 : (int64_t)0);
  }

  int64_t cumsum_let[2] = {0, 4};

  for (int64_t j = 0; j < ((int64_t*)const_vector_let)[0]; ++j) {
    concatenate_ext[(((int64_t*)cumsum_let)[0] + j)] = placeholder[j];
  }
  for (int64_t j1 = 0; j1 < ((int64_t*)const_vector_let)[1]; ++j1) {
    concatenate_ext[(((int64_t*)cumsum_let)[1] + j1)] = placeholder1[j1];
  }
  return 0;

@UlrikHjort-Bosch @vdkhoi @MichaelJKlaiber

Copy link
Contributor Author

@shtinsa shtinsa Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, That c code looks better but I tested "llvm" target, so that may be a difference in output.
The same time I should notice that select operator is used for filling up the indices table and this code can be excluded from the execution pipeline in case of static shaping. I.e. these tensors can be implemented as const buffers pre-allocated within the data section, but for dynamic shaping this improvement may have effect especially for the small data blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I implement the other version and we discuss what is best for all purposes then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comment to #11800 (comment)

@DzAvril
Copy link
Contributor

DzAvril commented Jun 23, 2022

Hi @shtinsa , the idea of pre-compute the coordinate mapping between output and inputs is great. But there is a pity that the extern_op cannot be inlined in the fused subgraph so that brings extra load and store effort. My question is why do you implement this by IRBuilder instead of by TE?

@shtinsa
Copy link
Contributor Author

shtinsa commented Jun 23, 2022

Hello @DzAvril, Problem is related to splitting of TE expression onto sub-areas, so having f(x1, ..., xk) as output is is a bit problematic to define particular input tensor from x1,..., xk coordinates. Using hashing function from the axis coordinates may require more intermediate memory.
Frankly speaking the concatenation price should be 0 in many cases but it may require a lot of efforts to implement this solution because it will be necessary to update whole SW stack within TVM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants