-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TE] Optimized version of concatenation layer #11341
[TE] Optimized version of concatenation layer #11341
Conversation
Hello @masahi could you please review this PR? |
Can you describe the motivation of the new implementation (I'm fully aware, but for others), and some perf data on real-world models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! Better concatenate performance would be great. And being able to fuse extern ops is even better!
I'm not sure I understand why/how the new inlining works. Could you explain (and put that explanation in the comment for where it happens)?
The reason of this implementation is based on DLRM model performance measurements where the concatenation layer takes about 20-25% of time budget for single threaded execution. The problem is connected with 27 concatenating objects. The bottleneck is related to source tensor search and current TE implementation requires O(N) operations to define source for copying. The resulting TIR source search script has following form (this is an internal part of the loop): |
b4853f5
to
2e3df18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shtinsa need to fix the conflict. |
strategy = _op.OpStrategy() | ||
strategy.add_implementation( | ||
wrap_compute_concat(topi.concatenate), | ||
wrap_topi_schedule(topi.generic.schedule_extern), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this shouldn't be schedule_extern
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoud I use schedule_injective? But it works only with llvm codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be schedule_injective
. Things in generic.py
are only used by cpu targets (cuda has its own strategies for concat etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, the result is here: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-11341/26/pipeline/456
25-th build was ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think you also need to define concatenate_strategy
in strategy/cuda.py
, and use topi.cuda.schedule_injective
there. That should allow using schedule_injective
here.
In general, if you add a new strategy in generic.py
, you should also update gpu strategy in strategy/cuda.py
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
python/tvm/topi/x86/injective.py
Outdated
from ..utils import is_empty_shape | ||
|
||
|
||
def schedule_injective_from_existing(sch, out): | ||
def schedule_injective_from_existing_ref(sch, out): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change and from tvm.topi.generic.injective import schedule_injective_from_existing
above don't seem necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately code reverting affects test_pass_partition_graph.py::test_multi_node_compiler test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is that? schedule_injective_from_existing
that you import at L22 doesn't seem to be used, and the other change is just renaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove it, test fails with crash: File "/home/sshtin/Dev/tvm/python/tvm/topi/x86/init.py", line 30, in
from .reduction import *
File "/home/sshtin/Dev/tvm/python/tvm/topi/x86/reduction.py", line 21, in
from .injective import schedule_injective_from_existing
ImportError: Error importing plugin "tvm.testing.plugin": cannot import name 'schedule_injective_from_existing' from 'tvm.topi.x86.injective' (/home/sshtin/Dev/tvm/python/tvm/topi/x86/in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what that error means, but you shouldn't be making this change anyway. If the old schedule_injective_from_existing
is used by other places, this change will make a different schedule_injective_from_existing
(in tvm.topi.generic.injective
) be called instead.
Why you cannot use the old schedule_injective_from_existing
at L150? And since tvm.topi.generic.injective.schedule_injective_from_existing
is just a one line schedule,
tvm/python/tvm/topi/generic/injective.py
Line 40 in 8131364
sch[out].fuse(*sch[out].op.axis) |
you can directly copy that in your concat schedule, to avoid importing and the name conflict issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
8cf733a
to
281cf28
Compare
Resolved |
a70e187
to
85c0c06
Compare
@shtinsa If the opaque one is faster, can we make your new concat always opaque and avoid the hack in schedule_dataflow_rewrite.cc? |
70052c2
to
1bc3b03
Compare
1. Concat implemented using extern_op 2. New tests added. 3. Workaround to allow inline extern_op-s with other layers.
0898dbf
to
37250d3
Compare
Formally yes, but it will be necessary to disable some tests in python/relay/test_pass_fuse_ops.py which check ability to fuse concatenate layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please improve the writing of comments in schedule_dataflow_rewrite.cc
.
python/tvm/topi/x86/injective.py
Outdated
@@ -60,14 +62,12 @@ def schedule_injective_from_existing(sch, out): | |||
|
|||
|
|||
def schedule_injective(outs): | |||
"""X86 schedule for injective op. | |||
|
|||
"""X86 reference schedule for injective op. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this diff
python/tvm/topi/x86/injective.py
Outdated
Parameters | ||
---------- | ||
outs: Array of Tensor | ||
The computation graph description of injective in the format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not injective
"device": 1, | ||
"io_size_bytes": 4800, | ||
"workspace_size_bytes": 1248, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this change affect workspace_size_bytes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concat code has 2 arrays and access to these arrays require 8 bytes for x64 and 4 bytes for 32 bits systems. That should be reason of difference. Should I recheck it?
@@ -525,8 +548,13 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) { | |||
for (auto iv : compute->axis) { | |||
args.push_back(iv->var); | |||
} | |||
if (ext_ops.find(stage->op) != ext_ops.end()) { | |||
// sshtin: The extern op can try to get access to the input tensors as a row data, | |||
// that can lead to error in TE scripts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by TE scripts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here you mean "IR builder" specifically?
@@ -511,6 +511,29 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) { | |||
std::vector<bool> changed(sch->stages.size(), false); | |||
std::vector<Stmt> new_hybrid_body(sch->stages.size()); | |||
std::vector<bool> hybrid_changed(sch->stages.size(), false); | |||
// (sshtin): this workaround allows to inline extern ops. | |||
// All inputs for extern op should not be inlined because inlining happens | |||
// before generation of TE script for particular extern op. That may lead to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by TE scripts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are gen_ir() or gen_ir_1d() scripts. The invocation of these scripts may happen after inlining pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about simply dropping script
? I don't think "TE script" is a common term. And "generation of TE" sounds ok.
// before generation of TE script for particular extern op. That may lead to | ||
// crash during lowering or building stages. | ||
// The problem description: | ||
// In case of operations fuzing arguments inlining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fusing
// The problem description: | ||
// In case of operations fuzing arguments inlining | ||
// prevents creation of ProducerNode for extern operation. | ||
// Instead of the creation it supposed to use operation argument as inlined buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is supposed to
@@ -511,6 +511,29 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) { | |||
std::vector<bool> changed(sch->stages.size(), false); | |||
std::vector<Stmt> new_hybrid_body(sch->stages.size()); | |||
std::vector<bool> hybrid_changed(sch->stages.size(), false); | |||
// (sshtin): this workaround allows to inline extern ops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this referring to "inline into extern ops", or "inline extern ops into their consumer"?
@@ -525,8 +548,13 @@ void InjectInline(ScheduleNode* sch, bool feature_extraction_mode) { | |||
for (auto iv : compute->axis) { | |||
args.push_back(iv->var); | |||
} | |||
if (ext_ops.find(stage->op) != ext_ops.end()) { | |||
// sshtin: The extern op can try to get access to the input tensors as a row data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry one more comment: Do you mean "raw data" here, or what is "row data" otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes :) it is looks like a CV phantom, Fixed
@shtinsa Need to run another job. |
I extracted some concat workloads from a few vision models we test and ran it locally on a 5900X, and I got these results. Blue ("A") is the commit before this PR, and orange ("B") is this one. Perhaps the sizes are too small to benefit from the handwritten kernel? workloads: https://gist.github.com/altanh/bccac6bf04393bbdae0f8440c59567e6 |
It can be several reasons of the issue (for example the copying data block is not aligned to SIMD size) so I'll check performance of the kernels from the json. |
dtype = data[0].dtype | ||
out_shape = data[0].shape[:axis] + [join_size] + data[0].shape[axis + 1 :] | ||
in_outers_tensor = const_vector(in_outers) | ||
in_cumsum_tensor = const_vector(in_outers_cumsum, name="cumsum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi shtinsa, why make in_outers_tensor
and in_cumsum_tensor
as te.tensor.Tensor
here? Function const_vector
brings select
in lowered tir. In my test, I kept them as lists of int
and passed them to the callback
function, the select
was gone and it was faster than te.tensor.Tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @DzAvril I analyzed compiled so files and disasm code, and code block for one concatenation looks like this:
v384 = v311 & 0xFFFFFFFFFFFFFFF0LL;
_RSI = v304 + 4 * (v276 + v305);
_RDX = 0LL;
do
{
__asm
{
vmovups ymm0, ymmword ptr [rax+rdx*4-20h]
vmovups ymm1, ymmword ptr [rax+rdx*4]
vmovups ymmword ptr [rsi+rdx*4-20h], ymm0
vmovups ymmword ptr [rsi+rdx*4], ymm1
}
_RDX += 16LL;
}
while ( v384 != _RDX );
_RSI = v311 & 0xFFFFFFFFFFFFFFF0LL;
if ( v311 != v384 )
goto LABEL_209;
LABEL_211:
if ( v310 <= 0 )
goto LABEL_219;
v285 = v306[25];
if ( (unsigned __int64)v310 < 0x10 )
{
_RSI = 0LL;
LABEL_217:
_RDX = v302 + 4 * v285;
do
{
_RCX = v356;
__asm
{
vmovss xmm0, dword ptr [rcx+rsi*4]
vmovss dword ptr [rdx+rsi*4], xmm0
}
++_RSI;
}
while ( v310 != _RSI );
goto LABEL_219;
}
So formally I would add some unrolling to copy loop and remove tiles evaluation for data-blocks proportional to SIMD line. But it is a very small improvement which should be implemented on codegen side. Anyway I'm going to check the performance of your's proposals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm this. We are currently working on a PR to change the behavior here.
Just as a reference the comparison of the resulting C code with plain list of ints
for (int32_t j = 0; j < 4; ++j) {
concatenate_ext[j] = placeholder[j];
}
for (int32_t j1 = 0; j1 < 4; ++j1) {
concatenate_ext[(j1 + 4)] = placeholder1[j1];
}
return 0;
and with te.tensor.Tensor
:
void* const_vector_let = (&(global_workspace_6_var[64]));
void* cumsum_let = (&(global_workspace_6_var[48]));
for (int32_t i = 0; i < 2; ++i) {
((int64_t*)const_vector_let)[i] = ((i == 1) ? (int64_t)4 : ((i == 0) ? (int64_t)4 : (int64_t)0));
}
for (int32_t i1 = 0; i1 < 2; ++i1) {
((int64_t*)cumsum_let)[i1] = ((i1 == 1) ? (int64_t)4 : (int64_t)0);
}
int64_t cumsum_let[2] = {0, 4};
for (int64_t j = 0; j < ((int64_t*)const_vector_let)[0]; ++j) {
concatenate_ext[(((int64_t*)cumsum_let)[0] + j)] = placeholder[j];
}
for (int64_t j1 = 0; j1 < ((int64_t*)const_vector_let)[1]; ++j1) {
concatenate_ext[(((int64_t*)cumsum_let)[1] + j1)] = placeholder1[j1];
}
return 0;
@UlrikHjort-Bosch @vdkhoi @MichaelJKlaiber
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, That c code looks better but I tested "llvm" target, so that may be a difference in output.
The same time I should notice that select
operator is used for filling up the indices table and this code can be excluded from the execution pipeline in case of static shaping. I.e. these tensors can be implemented as const buffers pre-allocated within the data section, but for dynamic shaping this improvement may have effect especially for the small data blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about I implement the other version and we discuss what is best for all purposes then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comment to #11800 (comment)
Hi @shtinsa , the idea of pre-compute the coordinate mapping between output and inputs is great. But there is a pity that the |
Hello @DzAvril, Problem is related to splitting of TE expression onto sub-areas, so having f(x1, ..., xk) as output is is a bit problematic to define particular input tensor from x1,..., xk coordinates. Using hashing function from the axis coordinates may require more intermediate memory. |
Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.