Clang is not aware of a false dependency of LZCNT, TZCNT, POPCNT on destination register on some Intel CPUs #33216

adhokshajmishra · 2017-07-21T02:12:47Z


Bugzilla Link	33869
Version	7.0
OS	Linux
Attachments	Test source code, dumped assmebler source code, and LLVM IR code
CC	@atdt,@chandlerc,@topperc,@DougGregor,@echristo,@hfinkel,@jlebar,@RKSimon,@rotateright,@ZviRackover

Extended Description

POPCNT instruction on Intel Skylake CPU seems have a false dependency on destination register, resulting in a performance loss if destination register is used immediately after POPCNT. The same bug has been present in Sandy Bridge, Ivy Bridge and Haswell processors as well.

While G++ seems to be aware of dependency, it does not generate code where false dependecy is triggerd. However, clang generated code gets hit immediately due to false dependency coming up again and again.

Platform Details

CPU: Intel(R) Core(TM) i7-6700HQ CPU
OS: Arch Linux x86_64 Kernel Version 4.11.9-1
Compilers: g++ (GCC) 7.1.1 20170630
clang version 4.0.1 (tags/RELEASE_401/final)

Test Code

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;

    uint64_t size = 10<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "Counter\t"  << count << "\nSpeed\t" << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

Code Generated by Clang

Compilation: clang++ poc.cpp -o poc_clang -O3 -march=native -std=c++14

[code stripped]

        popcnt	rcx, qword ptr [r15 + 8*rax]
	add	rcx, rbx
	popcnt	rdx, qword ptr [r15 + 8*rax + 8]
	add	rdx, rcx
	popcnt	rcx, qword ptr [r15 + 8*rax + 16]
	add	rcx, rdx
	popcnt	rdx, qword ptr [r15 + 8*rax + 24]
	add	rdx, rcx
	popcnt	rcx, qword ptr [r15 + 8*rax + 32]
	add	rcx, rdx
	popcnt	rdx, qword ptr [r15 + 8*rax + 40]
	add	rdx, rcx
	popcnt	rcx, qword ptr [r15 + 8*rax + 48]
	add	rcx, rdx
	popcnt	rbx, qword ptr [r15 + 8*rax + 56]

[code stripped]

In the above generated code, destination register of POPCNT is used in next instruction (write only). Due to false dependency, next line does not execute until destination register is ready for read (while we are only writing to it)

Code Generated by GCC

Compilation: g++ poc.cpp -o poc_gcc -O3 -march=native -std=c++14

[code stripped]

        xor	eax, eax
	xor	ecx, ecx
	popcnt	rax, QWORD PTR [rdx]
	popcnt	rcx, QWORD PTR 8[rdx]
	add	rax, rcx
	xor	ecx, ecx
	popcnt	rcx, QWORD PTR 16[rdx]
	add	rdx, 32
	add	rax, rcx
	xor	ecx, ecx
	popcnt	rcx, QWORD PTR -8[rdx]
	add	rax, rcx
	add	r12, rax
	cmp	rdx, r13

[code stripped]

In the code generated by GCC, false dependency is triggered in only 2 cases (in clang it is 7), resulting in faster performance.

The test code, dumped assembly code (dumped from compiler), and LLVM IR code is attached herewith (in ZIP)

topperc · 2017-07-21T02:35:45Z

Are tzcnt and lzcnt subject to the same problem? The gcc patch for this here seems to indicate they are https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=214279

adhokshajmishra · 2017-07-21T02:40:45Z

I will test TZCNT and LZCNT and update here by tonight (Indian Standard Time)

topperc · 2017-07-21T06:18:10Z

Adding Dave Kreitzer and Asaf Badouh. There had been a previous patch (https://reviews.llvm.org/D17289) to workaround this issue, but that review stalled and I don't know what happened after that.

Also moving this to the X86 library.

adhokshajmishra · 2017-07-21T06:44:05Z

I just tested LZCNT and TZCNT on Clang 4.0. These instructions do not appear to have a false dependency as stated in gcc patch. Probably the bug was fixed in newer chips (the linked patch is almost 3 years old).

However, GCC 6.3.1 seems to be generating suboptimal code for LZCNT and TZCNT (performance hit by ~40% as compared to Clang 4.0). I will submit a bug report to GCC once I am done with all the tests and PoC for GCC.

topperc · 2017-07-21T06:54:15Z

Can you elaborate on the suboptimal code gcc is generating for lzcnt/tzcnt?

adhokshajmishra · 2017-07-21T07:44:13Z

Test code (TZCNT)

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _tzcnt_u64(buffer[i]);
                count += _tzcnt_u64(buffer[i+1]);
                count += _tzcnt_u64(buffer[i+2]);
                count += _tzcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _tzcnt_u64(buffer[i]);
                count += _tzcnt_u64(buffer[i+1]);
                count += _tzcnt_u64(buffer[i+2]);
                count += _tzcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

Compile command: g++ tzcnt.cpp -o tzcnt_gcc -O3 -march=native -std=c++14
clang++ tzcnt.cpp -o tzcnt_clang -O3 -march=native -std=c++14

Output:

./tzcnt_clang 4

unsigned 5244900000 2.19225 sec 19.1324 GB/s
uint64_t 5244900000 2.15938 sec 19.4236 GB/s

./tzcnt_gcc 4

unsigned 5244900000 2.50378 sec 16.7519 GB/s
uint64_t 5244900000 1.88483 sec 22.2529 GB/s

Clang is generating code with consistent performance which falls somewhere in middle of the performance counts produced by g++ code. I guess Clang too can be improved here.

Disassembly (Clang)

[code stripped]

.LBB0_13:                               #   Parent Loop BB0_12 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	tzcnt	rdx, qword ptr [r15 + 8*rdx]
	add	rdx, rbx
	lea	esi, [rcx - 3]
	tzcnt	rsi, qword ptr [r15 + 8*rsi]
	add	rsi, rdx
	lea	edx, [rcx - 2]
	tzcnt	rdx, qword ptr [r15 + 8*rdx]
	add	rdx, rsi
	lea	esi, [rcx - 1]
	tzcnt	rbx, qword ptr [r15 + 8*rsi]
	add	rbx, rdx
	mov	edx, ecx
	add	ecx, 4
	cmp	rdx, r13
	jb	.LBB0_13
# BB#14:                                #   in Loop: Header=BB0_12 Depth=1
	mov	ecx, 4
	xor	edx, edx
	.p2align	4, 0x90
.LBB0_15:                               #   Parent Loop BB0_12 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	tzcnt	rdx, qword ptr [r15 + 8*rdx]
	add	rdx, rbx
	lea	esi, [rcx - 3]
	tzcnt	rsi, qword ptr [r15 + 8*rsi]
	add	rsi, rdx
	lea	edx, [rcx - 2]
	tzcnt	rdx, qword ptr [r15 + 8*rdx]
	add	rdx, rsi
	lea	esi, [rcx - 1]
	tzcnt	rbx, qword ptr [r15 + 8*rsi]
	add	rbx, rdx
	mov	edx, ecx
	add	ecx, 4
	cmp	rdx, r13
	jb	.LBB0_15

[code stripped]

.LBB0_22:                               #   Parent Loop BB0_21 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	tzcnt	rdx, qword ptr [r15 + 8*rcx]
	add	rdx, rbx
	tzcnt	rsi, qword ptr [r15 + 8*rcx + 8]
	add	rsi, rdx
	tzcnt	rdx, qword ptr [r15 + 8*rcx + 16]
	add	rdx, rsi
	tzcnt	rsi, qword ptr [r15 + 8*rcx + 24]
	add	rsi, rdx
	tzcnt	rdx, qword ptr [r15 + 8*rcx + 32]
	add	rdx, rsi
	tzcnt	rsi, qword ptr [r15 + 8*rcx + 40]
	add	rsi, rdx
	tzcnt	rdx, qword ptr [r15 + 8*rcx + 48]
	add	rdx, rsi
	tzcnt	rbx, qword ptr [r15 + 8*rcx + 56]
	add	rbx, rdx
	add	rcx, 8
	cmp	rcx, r13
	jb	.LBB0_22

[code stripped]

Disassembly (GCC)

[code stripped]

.L8:
	xor	edi, edi
	tzcnt	rdi, QWORD PTR 0[rbp+rax*8]
	lea	eax, 1[rdx]
	xor	esi, esi
	lea	ecx, 2[rdx]
	tzcnt	rax, QWORD PTR 0[rbp+rax*8]
	tzcnt	rsi, QWORD PTR 0[rbp+rcx*8]
	lea	ecx, 3[rdx]
	tzcnt	rcx, QWORD PTR 0[rbp+rcx*8]
	add	rax, rdi
	add	rax, rsi
	add	rax, rcx
	add	r14, rax
	lea	eax, 4[rdx]
	mov	rdx, rax
	cmp	rax, r12
	jb	.L8

[code stripped]

.L13:
	xor	edi, edi
	xor	eax, eax
	tzcnt	rdi, QWORD PTR [rdx]
	xor	esi, esi
	tzcnt	rax, QWORD PTR 8[rdx]
	xor	ecx, ecx
	add	rdx, 32
	tzcnt	rsi, QWORD PTR -16[rdx]
	tzcnt	rcx, QWORD PTR -8[rdx]
	add	rax, rdi
	add	rax, rsi
	add	rax, rcx
	add	r13, rax
	cmp	rdx, r8
	jne	.L13

[code stripped]

In case of Clang, both loops produce similar performance, while with GCC, second loop if much faster than first one. I guess both compilers can be improved a bit in this case.

adhokshajmishra · 2017-07-21T07:52:10Z

Test code (LZCNT): replace _tzcnt_u64 with _lzcnt_u64

Output

./lzcnt_clang 1

unsigned 1306770000 0.502319 sec 20.8747 GB/s
uint64_t 1306770000 0.461493 sec 22.7214 GB/s

./lzcnt_gcc 1

unsigned 1306770000 0.543904 sec 19.2787 GB/s
uint64_t 1306770000 0.426979 sec 24.558 GB/s

Again, Clang is generating code with consistent performance which falls somewhere in middle of the performance counts produced by g++ code.

Disassembly (Clang)

[code stripped]
.LBB0_13: # Parent Loop BB0_12 Depth=1
# => This Inner Loop Header: Depth=2
lzcnt rdx, qword ptr [r15 + 8rdx]
add rdx, rbx
lea esi, [rcx - 3]
lzcnt rsi, qword ptr [r15 + 8rsi]
add rsi, rdx
lea edx, [rcx - 2]
lzcnt rdx, qword ptr [r15 + 8rdx]
add rdx, rsi
lea esi, [rcx - 1]
lzcnt rbx, qword ptr [r15 + 8rsi]
add rbx, rdx
mov edx, ecx
add ecx, 4
cmp rdx, r13
jb .LBB0_13

BB#14: # in Loop: Header=BB0_12 Depth=1

mov	ecx, 4
xor	edx, edx
.p2align	4, 0x90

.LBB0_15: # Parent Loop BB0_12 Depth=1
# => This Inner Loop Header: Depth=2
lzcnt rdx, qword ptr [r15 + 8rdx]
add rdx, rbx
lea esi, [rcx - 3]
lzcnt rsi, qword ptr [r15 + 8rsi]
add rsi, rdx
lea edx, [rcx - 2]
lzcnt rdx, qword ptr [r15 + 8rdx]
add rdx, rsi
lea esi, [rcx - 1]
lzcnt rbx, qword ptr [r15 + 8rsi]
add rbx, rdx
mov edx, ecx
add ecx, 4
cmp rdx, r13
jb .LBB0_15
[code stripped]
.LBB0_22: # Parent Loop BB0_21 Depth=1
# => This Inner Loop Header: Depth=2
lzcnt rdx, qword ptr [r15 + 8rcx]
add rdx, rbx
lzcnt rsi, qword ptr [r15 + 8rcx + 8]
add rsi, rdx
lzcnt rdx, qword ptr [r15 + 8rcx + 16]
add rdx, rsi
lzcnt rsi, qword ptr [r15 + 8rcx + 24]
add rsi, rdx
lzcnt rdx, qword ptr [r15 + 8rcx + 32]
add rdx, rsi
lzcnt rsi, qword ptr [r15 + 8rcx + 40]
add rsi, rdx
lzcnt rdx, qword ptr [r15 + 8rcx + 48]
add rdx, rsi
lzcnt rbx, qword ptr [r15 + 8rcx + 56]
add rbx, rdx
add rcx, 8
cmp rcx, r13
jb .LBB0_22
[code stripped]

Disassembly (GCC)

[code stripped]
.L8:
xor edi, edi
lzcnt rdi, QWORD PTR 0[rbp+rax8]
lea eax, 1[rdx]
lea ecx, 2[rdx]
lzcnt rax, QWORD PTR 0[rbp+rax8]
add rax, rdi
xor esi, esi
lzcnt rsi, QWORD PTR 0[rbp+rcx8]
lea ecx, 3[rdx]
add rax, rsi
lzcnt rcx, QWORD PTR 0[rbp+rcx8]
add rax, rcx
add r14, rax
lea eax, 4[rdx]
mov rdx, rax
cmp rax, r12
jb .L8
[code stripped]
.L13:
xor edi, edi
xor eax, eax
lzcnt rdi, QWORD PTR [rdx]
xor esi, esi
lzcnt rax, QWORD PTR 8[rdx]
add rax, rdi
lzcnt rsi, QWORD PTR 16[rdx]
xor ecx, ecx
add rax, rsi
lzcnt rcx, QWORD PTR 24[rdx]
add rdx, 32
add rax, rcx
add r13, rax
cmp rdx, r8
jne .L13
[code stripped]

adhokshajmishra · 2017-07-21T07:56:36Z

Test code and disassembly for LZCNT and TZCNT (gcc and clang)

llvmbot · 2017-07-21T17:00:14Z

Adding Dave Kreitzer and Asaf Badouh. There had been a previous patch
(https://reviews.llvm.org/D17289) to workaround this issue, but that review
stalled and I don't know what happened after that.

Also moving this to the X86 library.

AFAIK, my comments in https://reviews.llvm.org/D17289 reflect the latest status/plans:

The best/cheapest way to fix the popcnt false dependence is via biased register selection in the RA such that the false dependence is "masked" by a true dependence. That won't work in this case, though, since all the input registers to the popcnt instructions interfere with the outputs.

Failing that, we should conditionally generate an xor to break the dependence. The logic for deciding when it is profitable to generate the xor already exists in the ExecutionDepsFix pass for instructions like X86::SQRTSDr. So the plan was to enhance that pass to catch this popcnt case.

At one point, this was on Marina's to-do, so I am adding her.

topperc · 2017-07-21T17:40:30Z

Looks like the execution dependency fix pass currently only wants to deal with VR128 registers and their superclasses. Not sure if we should add another copy for GR32 or teach it's constructor to take a list of register classes. Or maybe the partial/undef register parts should be separated from execution domain?

llvmbot · 2017-07-21T18:36:22Z

Looks like the execution dependency fix pass currently only wants to deal
with VR128 registers and their superclasses. Not sure if we should add
another copy for GR32 or teach it's constructor to take a list of register
classes. Or maybe the partial/undef register parts should be separated from
execution domain?

Separating the false dependence avoidance parts from the execution domain parts seems like a good idea to me. But I admit I haven't studied the details of the pass enough to know what we are saving by combining the two.

It is also worth pointing out that the X86FixupBWInsts pass plays a role here. If we were to separate out the false dependence avoidance functionality, it would good to fold in the X86FixupBWInsts functionality, since it is trying to accomplish the same thing.

jlebar · 2017-10-13T00:23:09Z

*** Bug llvm/llvm-bugzilla-archive#34936 has been marked as a duplicate of this bug. ***

chandlerc · 2017-10-19T20:36:55Z

FYI, we may be hitting this issue with LZCNT these days. Any progress on resolving this?

atdt · 2017-10-23T15:40:53Z

Stand-alone test case showing regression for -march=haswell
Attached an additional, stand-alone test case showing a performance regression when the same code is compiled for Haswell vs. older microarchitectures (like corei7).

llvmbot · 2017-10-25T16:56:00Z

I've started working on a solution.

Basically, the infrastructure is there - ExecutionDepsFix has both the ability to add an XOR if there is not enough clearance (e.g not enough instructions since the last write to this reg) and the ability to hide a false dependency behind a true dependency if possible.
As mentioned, currently it is only adjusted to VR128XRegClass, but after playing around with it - it can be adjusted to other reg classes too without too much trouble.

Another issue that was suggested is breaking ExecutionDepsFix into 2 passes, one that handles execution domains and one that handles dependency.
These 2 problems are different and the only logic in this pass that combined the 2 issues is the way we traverse the basic blocks (and even there there is a bit of difference between the 2). However, even if we separate them, the "break false deps" part cannot be lowered into X86FixupBWInsts because this logic is not target specific.

In the long run it is probably better to separate the 2 passes, even if we cannot combine it with X86FixupBWInsts . Do you agree?

topperc · 2017-11-01T18:00:40Z

I think splitting it into a separate pass makes sense.

Do any other targets implement getPartialRegUpdateClearance/getUndefRegUpdateClearance? If so those targets will also need to run the new pass which would be mean we couldn't merge into FixupBWInsts anyway

llvmbot · 2017-11-01T18:24:41Z

As far as I can see, ARM implements it in ARMBaseInstrInfo::getPartialRegUpdateClearance().

Regarding getUndefRegClearance(), it seems only x86 target implements this.

llvmbot · 2017-11-14T01:27:53Z

Running iaca over uploaded lzcnt_gcc.s and lzcnt_clang.s is giving following results

Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
Analyzed File - ./lzcnt.gcc
Binary Format - 64Bit
Architecture - SKX
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 4.71 Cycles Throughput Bottleneck: FrontEnd

Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
Analyzed File - lzcnt.clang
Binary Format - 64Bit
Architecture - SKX
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 4.24 Cycles Throughput Bottleneck: FrontEnd

Is IACA not considering false pependency in its compurations.

Uploading the detailed results.

llvmbot · 2017-11-14T01:29:12Z

IACA results

llvmbot · 2017-11-14T16:54:05Z

My patch is almost done and it handles popcnt, lzcnt, tzcnt.
It will be upload in the next few days.

llvmbot · 2017-11-22T10:34:53Z

I've uploaded the following reviews:

https://reviews.llvm.org/D40330 - Separate ExecutionDepsFix into 3 parts
https://reviews.llvm.org/D40331 - clang-format ExecutionDepsFix
https://reviews.llvm.org/D40332 - Rename ExecutionDepsFix file to ExecutionDomainFix
https://reviews.llvm.org/D40333 - Separate LoopTraversal and BreakFalseDeps out of ExecutionDomainFix into their own files
https://reviews.llvm.org/D40334 - [X86] Break false dependencies for POPCNT, LZCNT, TZCNT

These reviews are built one on top of the other and try to make the logic separation and file renaming readable.
Most of the changes are in the first and last reviews, the 3 middle reviews are more for cosmetics.

atdt · 2017-12-06T19:11:14Z

Thank you for your work on this issue, Marina! I look forward to seeing it land.

llvmbot · 2018-01-22T20:29:43Z

Committed a fix in rL323096.

atdt · 2018-03-20T20:21:12Z

As far as I can tell, clang still does not break the dependency in the reproduction case I attached in comment 14. Minimally:

#include <cstdint>
#include <x86intrin.h>

__attribute__((noinline))
int msb(uint64_t n) {
	return 63 ^ __builtin_clzll(n);
}

clang version 7.0.0 (trunk 327823), -O2 -march=haswell:

lzcnt rax, rdi
xor eax, 63
ret

g++ 8.0.1 20180319, -O2 -march=haswell:

xor eax, eax
lzcnt rax, rdi
xor eax, 63
ret

https://godbolt.org/g/JC57Ri

The failure to break the dependency chain causes a measurable degradation in performance when the function is called in a loop. I tested on one Haswell machine and one Broadwell machine.

Worse, clang is shooting itself in the foot. If you compile the same code but target an older microarchitecture w/no lzcnt (-march=core-i7 for example), clang emits a bsr instruction instead, which doesn't appear to suffer from this false dependency issue.

topperc · 2018-03-20T20:35:26Z

I'll look into this, but why are you preventing the inlining of the function.

llvmbot · 2018-03-21T12:03:38Z

We have an issue with breaking false dependencies on function entry/calls (well, basically we don't do that while other compilers do). What you described maps to this and not to lzcnt specific issue.
There was a review by Keno a year ago that was suppose to solve this:
https://reviews.llvm.org/D28786

atdt · 2018-04-04T15:03:22Z

I'll look into this, but why are you preventing the inlining of the function.

Only for the sake of having a minimal reproduction case. We've run into this issue normal code that is not forcing or preventing inlining.

atdt · 2018-04-04T15:05:02Z

We have an issue with breaking false dependencies on function entry/calls
(well, basically we don't do that while other compilers do). What you
described maps to this and not to lzcnt specific issue.
There was a review by Keno a year ago that was suppose to solve this:
https://reviews.llvm.org/D28786

Thanks for the explanation. Would you like me to file a separate bug for that?

adhokshajmishra · 2019-01-17T04:32:12Z

I just tested this with clang 7.0.1, and the compiler is still generating suboptimal code for TZCNT and LZCNT. Separate bug report has been filed for POPCNT where it is generating suboptimal vectorized code for Intel(R) Core(TM) i7-6700HQ CPU. Please refer to #40346.

jlebar · 2021-11-26T23:52:31Z

mentioned in issue llvm/llvm-bugzilla-archive#34936

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

This was referenced Jul 30, 2022

[MC][llvm-mca] Teach how to identify instructions that a false dependency on the destination register. #38161

Open

Suboptimal code generation for __builtin_ctz(ll) #34191

Closed

Odd inline asm code generation with pointless memory operand #56789

Closed

ValeZAA mentioned this issue Jun 13, 2023

[X86] movb is extremely expensive on modern processors #62948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clang is not aware of a false dependency of LZCNT, TZCNT, POPCNT on destination register on some Intel CPUs #33216

Clang is not aware of a false dependency of LZCNT, TZCNT, POPCNT on destination register on some Intel CPUs #33216

adhokshajmishra mannequin commented Jul 21, 2017 •

edited by nickdesaulniers

Loading

topperc commented Jul 21, 2017

adhokshajmishra mannequin commented Jul 21, 2017

topperc commented Jul 21, 2017

adhokshajmishra mannequin commented Jul 21, 2017

topperc commented Jul 21, 2017

adhokshajmishra mannequin commented Jul 21, 2017 •

edited by nickdesaulniers

Loading

adhokshajmishra mannequin commented Jul 21, 2017

adhokshajmishra mannequin commented Jul 21, 2017

llvmbot commented Jul 21, 2017

topperc commented Jul 21, 2017

llvmbot commented Jul 21, 2017

jlebar commented Oct 13, 2017

chandlerc commented Oct 19, 2017

atdt commented Oct 23, 2017

llvmbot commented Oct 25, 2017

topperc commented Nov 1, 2017

llvmbot commented Nov 1, 2017

llvmbot commented Nov 14, 2017

llvmbot commented Nov 14, 2017

llvmbot commented Nov 14, 2017

llvmbot commented Nov 22, 2017

atdt commented Dec 6, 2017

llvmbot commented Jan 22, 2018

atdt commented Mar 20, 2018

topperc commented Mar 20, 2018

llvmbot commented Mar 21, 2018

atdt commented Apr 4, 2018

atdt commented Apr 4, 2018

adhokshajmishra mannequin commented Jan 17, 2019

jlebar commented Nov 26, 2021

Clang is not aware of a false dependency of LZCNT, TZCNT, POPCNT on destination register on some Intel CPUs #33216

Clang is not aware of a false dependency of LZCNT, TZCNT, POPCNT on destination register on some Intel CPUs #33216

Comments

adhokshajmishra mannequin commented Jul 21, 2017 • edited by nickdesaulniers Loading

Extended Description

Platform Details

Test Code

Code Generated by Clang

Code Generated by GCC

topperc commented Jul 21, 2017

adhokshajmishra mannequin commented Jul 21, 2017

topperc commented Jul 21, 2017

adhokshajmishra mannequin commented Jul 21, 2017

topperc commented Jul 21, 2017

adhokshajmishra mannequin commented Jul 21, 2017 • edited by nickdesaulniers Loading

Output:

Disassembly (Clang)

Disassembly (GCC)

adhokshajmishra mannequin commented Jul 21, 2017

Output

Disassembly (Clang)

BB#14: # in Loop: Header=BB0_12 Depth=1

Disassembly (GCC)

adhokshajmishra mannequin commented Jul 21, 2017

llvmbot commented Jul 21, 2017

topperc commented Jul 21, 2017

llvmbot commented Jul 21, 2017

jlebar commented Oct 13, 2017

chandlerc commented Oct 19, 2017

atdt commented Oct 23, 2017

llvmbot commented Oct 25, 2017

topperc commented Nov 1, 2017

llvmbot commented Nov 1, 2017

llvmbot commented Nov 14, 2017

Throughput Analysis Report

Throughput Analysis Report

llvmbot commented Nov 14, 2017

llvmbot commented Nov 14, 2017

llvmbot commented Nov 22, 2017

atdt commented Dec 6, 2017

llvmbot commented Jan 22, 2018

atdt commented Mar 20, 2018

topperc commented Mar 20, 2018

llvmbot commented Mar 21, 2018

atdt commented Apr 4, 2018

atdt commented Apr 4, 2018

adhokshajmishra mannequin commented Jan 17, 2019

jlebar commented Nov 26, 2021

adhokshajmishra mannequin commented Jul 21, 2017 •

edited by nickdesaulniers

Loading

adhokshajmishra mannequin commented Jul 21, 2017 •

edited by nickdesaulniers

Loading