-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clang is not aware of a false dependency of LZCNT, TZCNT, POPCNT on destination register on some Intel CPUs #33216
Comments
Are tzcnt and lzcnt subject to the same problem? The gcc patch for this here seems to indicate they are https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=214279 |
I will test TZCNT and LZCNT and update here by tonight (Indian Standard Time) |
Adding Dave Kreitzer and Asaf Badouh. There had been a previous patch (https://reviews.llvm.org/D17289) to workaround this issue, but that review stalled and I don't know what happened after that. Also moving this to the X86 library. |
I just tested LZCNT and TZCNT on Clang 4.0. These instructions do not appear to have a false dependency as stated in gcc patch. Probably the bug was fixed in newer chips (the linked patch is almost 3 years old). However, GCC 6.3.1 seems to be generating suboptimal code for LZCNT and TZCNT (performance hit by ~40% as compared to Clang 4.0). I will submit a bug report to GCC once I am done with all the tests and PoC for GCC. |
Can you elaborate on the suboptimal code gcc is generating for lzcnt/tzcnt? |
Test code (TZCNT) #include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
if (argc != 2) {
cerr << "usage: array_size in MB" << endl;
return -1;
}
uint64_t size = atol(argv[1])<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer = reinterpret_cast<char*>(buffer);
for (unsigned i=0; i<size; ++i)
charbuffer[i] = rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
startP = chrono::system_clock::now();
count = 0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with unsigned
for (unsigned i=0; i<size/8; i+=4) {
count += _tzcnt_u64(buffer[i]);
count += _tzcnt_u64(buffer[i+1]);
count += _tzcnt_u64(buffer[i+2]);
count += _tzcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
startP = chrono::system_clock::now();
count=0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with uint64_t
for (uint64_t i=0;i<size/8;i+=4) {
count += _tzcnt_u64(buffer[i]);
count += _tzcnt_u64(buffer[i+1]);
count += _tzcnt_u64(buffer[i+2]);
count += _tzcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
} Compile command: g++ tzcnt.cpp -o tzcnt_gcc -O3 -march=native -std=c++14 Output:./tzcnt_clang 4 unsigned 5244900000 2.19225 sec 19.1324 GB/s ./tzcnt_gcc 4 unsigned 5244900000 2.50378 sec 16.7519 GB/s Clang is generating code with consistent performance which falls somewhere in middle of the performance counts produced by g++ code. I guess Clang too can be improved here. Disassembly (Clang)[code stripped] .LBB0_13: # Parent Loop BB0_12 Depth=1
# => This Inner Loop Header: Depth=2
tzcnt rdx, qword ptr [r15 + 8*rdx]
add rdx, rbx
lea esi, [rcx - 3]
tzcnt rsi, qword ptr [r15 + 8*rsi]
add rsi, rdx
lea edx, [rcx - 2]
tzcnt rdx, qword ptr [r15 + 8*rdx]
add rdx, rsi
lea esi, [rcx - 1]
tzcnt rbx, qword ptr [r15 + 8*rsi]
add rbx, rdx
mov edx, ecx
add ecx, 4
cmp rdx, r13
jb .LBB0_13
# BB#14: # in Loop: Header=BB0_12 Depth=1
mov ecx, 4
xor edx, edx
.p2align 4, 0x90
.LBB0_15: # Parent Loop BB0_12 Depth=1
# => This Inner Loop Header: Depth=2
tzcnt rdx, qword ptr [r15 + 8*rdx]
add rdx, rbx
lea esi, [rcx - 3]
tzcnt rsi, qword ptr [r15 + 8*rsi]
add rsi, rdx
lea edx, [rcx - 2]
tzcnt rdx, qword ptr [r15 + 8*rdx]
add rdx, rsi
lea esi, [rcx - 1]
tzcnt rbx, qword ptr [r15 + 8*rsi]
add rbx, rdx
mov edx, ecx
add ecx, 4
cmp rdx, r13
jb .LBB0_15 [code stripped] .LBB0_22: # Parent Loop BB0_21 Depth=1
# => This Inner Loop Header: Depth=2
tzcnt rdx, qword ptr [r15 + 8*rcx]
add rdx, rbx
tzcnt rsi, qword ptr [r15 + 8*rcx + 8]
add rsi, rdx
tzcnt rdx, qword ptr [r15 + 8*rcx + 16]
add rdx, rsi
tzcnt rsi, qword ptr [r15 + 8*rcx + 24]
add rsi, rdx
tzcnt rdx, qword ptr [r15 + 8*rcx + 32]
add rdx, rsi
tzcnt rsi, qword ptr [r15 + 8*rcx + 40]
add rsi, rdx
tzcnt rdx, qword ptr [r15 + 8*rcx + 48]
add rdx, rsi
tzcnt rbx, qword ptr [r15 + 8*rcx + 56]
add rbx, rdx
add rcx, 8
cmp rcx, r13
jb .LBB0_22 [code stripped] Disassembly (GCC)[code stripped] .L8:
xor edi, edi
tzcnt rdi, QWORD PTR 0[rbp+rax*8]
lea eax, 1[rdx]
xor esi, esi
lea ecx, 2[rdx]
tzcnt rax, QWORD PTR 0[rbp+rax*8]
tzcnt rsi, QWORD PTR 0[rbp+rcx*8]
lea ecx, 3[rdx]
tzcnt rcx, QWORD PTR 0[rbp+rcx*8]
add rax, rdi
add rax, rsi
add rax, rcx
add r14, rax
lea eax, 4[rdx]
mov rdx, rax
cmp rax, r12
jb .L8 [code stripped] .L13:
xor edi, edi
xor eax, eax
tzcnt rdi, QWORD PTR [rdx]
xor esi, esi
tzcnt rax, QWORD PTR 8[rdx]
xor ecx, ecx
add rdx, 32
tzcnt rsi, QWORD PTR -16[rdx]
tzcnt rcx, QWORD PTR -8[rdx]
add rax, rdi
add rax, rsi
add rax, rcx
add r13, rax
cmp rdx, r8
jne .L13 [code stripped] In case of Clang, both loops produce similar performance, while with GCC, second loop if much faster than first one. I guess both compilers can be improved a bit in this case. |
Test code (LZCNT): replace _tzcnt_u64 with _lzcnt_u64 Output./lzcnt_clang 1 unsigned 1306770000 0.502319 sec 20.8747 GB/s ./lzcnt_gcc 1 unsigned 1306770000 0.543904 sec 19.2787 GB/s Again, Clang is generating code with consistent performance which falls somewhere in middle of the performance counts produced by g++ code. Disassembly (Clang)[code stripped] BB#14: # in Loop: Header=BB0_12 Depth=1
.LBB0_15: # Parent Loop BB0_12 Depth=1 Disassembly (GCC)[code stripped] |
AFAIK, my comments in https://reviews.llvm.org/D17289 reflect the latest status/plans: The best/cheapest way to fix the popcnt false dependence is via biased register selection in the RA such that the false dependence is "masked" by a true dependence. That won't work in this case, though, since all the input registers to the popcnt instructions interfere with the outputs. Failing that, we should conditionally generate an xor to break the dependence. The logic for deciding when it is profitable to generate the xor already exists in the ExecutionDepsFix pass for instructions like X86::SQRTSDr. So the plan was to enhance that pass to catch this popcnt case. At one point, this was on Marina's to-do, so I am adding her. |
Looks like the execution dependency fix pass currently only wants to deal with VR128 registers and their superclasses. Not sure if we should add another copy for GR32 or teach it's constructor to take a list of register classes. Or maybe the partial/undef register parts should be separated from execution domain? |
Separating the false dependence avoidance parts from the execution domain parts seems like a good idea to me. But I admit I haven't studied the details of the pass enough to know what we are saving by combining the two. It is also worth pointing out that the X86FixupBWInsts pass plays a role here. If we were to separate out the false dependence avoidance functionality, it would good to fold in the X86FixupBWInsts functionality, since it is trying to accomplish the same thing. |
*** Bug llvm/llvm-bugzilla-archive#34936 has been marked as a duplicate of this bug. *** |
FYI, we may be hitting this issue with LZCNT these days. Any progress on resolving this? |
Stand-alone test case showing regression for -march=haswell |
I've started working on a solution. Basically, the infrastructure is there - ExecutionDepsFix has both the ability to add an XOR if there is not enough clearance (e.g not enough instructions since the last write to this reg) and the ability to hide a false dependency behind a true dependency if possible. Another issue that was suggested is breaking ExecutionDepsFix into 2 passes, one that handles execution domains and one that handles dependency. In the long run it is probably better to separate the 2 passes, even if we cannot combine it with X86FixupBWInsts . Do you agree? |
I think splitting it into a separate pass makes sense. Do any other targets implement getPartialRegUpdateClearance/getUndefRegUpdateClearance? If so those targets will also need to run the new pass which would be mean we couldn't merge into FixupBWInsts anyway |
As far as I can see, ARM implements it in ARMBaseInstrInfo::getPartialRegUpdateClearance(). Regarding getUndefRegClearance(), it seems only x86 target implements this. |
Running iaca over uploaded lzcnt_gcc.s and lzcnt_clang.s is giving following results Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300) Throughput Analysis ReportBlock Throughput: 4.71 Cycles Throughput Bottleneck: FrontEnd Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300) Throughput Analysis ReportBlock Throughput: 4.24 Cycles Throughput Bottleneck: FrontEnd Is IACA not considering false pependency in its compurations. Uploading the detailed results. |
My patch is almost done and it handles popcnt, lzcnt, tzcnt. |
I've uploaded the following reviews: https://reviews.llvm.org/D40330 - Separate ExecutionDepsFix into 3 parts These reviews are built one on top of the other and try to make the logic separation and file renaming readable. |
Thank you for your work on this issue, Marina! I look forward to seeing it land. |
Committed a fix in rL323096. |
As far as I can tell, clang still does not break the dependency in the reproduction case I attached in comment 14. Minimally:
clang version 7.0.0 (trunk 327823), -O2 -march=haswell:
g++ 8.0.1 20180319, -O2 -march=haswell:
The failure to break the dependency chain causes a measurable degradation in performance when the function is called in a loop. I tested on one Haswell machine and one Broadwell machine. Worse, clang is shooting itself in the foot. If you compile the same code but target an older microarchitecture w/no lzcnt (-march=core-i7 for example), clang emits a bsr instruction instead, which doesn't appear to suffer from this false dependency issue. |
I'll look into this, but why are you preventing the inlining of the function. |
We have an issue with breaking false dependencies on function entry/calls (well, basically we don't do that while other compilers do). What you described maps to this and not to lzcnt specific issue. |
Only for the sake of having a minimal reproduction case. We've run into this issue normal code that is not forcing or preventing inlining. |
Thanks for the explanation. Would you like me to file a separate bug for that? |
I just tested this with clang 7.0.1, and the compiler is still generating suboptimal code for TZCNT and LZCNT. Separate bug report has been filed for POPCNT where it is generating suboptimal vectorized code for Intel(R) Core(TM) i7-6700HQ CPU. Please refer to #40346. |
mentioned in issue llvm/llvm-bugzilla-archive#34936 |
Extended Description
POPCNT instruction on Intel Skylake CPU seems have a false dependency on destination register, resulting in a performance loss if destination register is used immediately after POPCNT. The same bug has been present in Sandy Bridge, Ivy Bridge and Haswell processors as well.
While G++ seems to be aware of dependency, it does not generate code where false dependecy is triggerd. However, clang generated code gets hit immediately due to false dependency coming up again and again.
Platform Details
CPU: Intel(R) Core(TM) i7-6700HQ CPU
OS: Arch Linux x86_64 Kernel Version 4.11.9-1
Compilers: g++ (GCC) 7.1.1 20170630
clang version 4.0.1 (tags/RELEASE_401/final)
Test Code
Code Generated by Clang
Compilation: clang++ poc.cpp -o poc_clang -O3 -march=native -std=c++14
[code stripped]
[code stripped]
In the above generated code, destination register of POPCNT is used in next instruction (write only). Due to false dependency, next line does not execute until destination register is ready for read (while we are only writing to it)
Code Generated by GCC
Compilation: g++ poc.cpp -o poc_gcc -O3 -march=native -std=c++14
[code stripped]
[code stripped]
In the code generated by GCC, false dependency is triggered in only 2 cases (in clang it is 7), resulting in faster performance.
The test code, dumped assembly code (dumped from compiler), and LLVM IR code is attached herewith (in ZIP)
The text was updated successfully, but these errors were encountered: