Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang is not aware of a false dependency of LZCNT, TZCNT, POPCNT on destination register on some Intel CPUs #33216

Open
adhokshajmishra mannequin opened this issue Jul 21, 2017 · 31 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@adhokshajmishra
Copy link
Mannequin

adhokshajmishra mannequin commented Jul 21, 2017

Bugzilla Link 33869
Version 7.0
OS Linux
Attachments Test source code, dumped assmebler source code, and LLVM IR code
CC @atdt,@chandlerc,@topperc,@DougGregor,@echristo,@hfinkel,@jlebar,@RKSimon,@rotateright,@ZviRackover

Extended Description

POPCNT instruction on Intel Skylake CPU seems have a false dependency on destination register, resulting in a performance loss if destination register is used immediately after POPCNT. The same bug has been present in Sandy Bridge, Ivy Bridge and Haswell processors as well.

While G++ seems to be aware of dependency, it does not generate code where false dependecy is triggerd. However, clang generated code gets hit immediately due to false dependency coming up again and again.

Platform Details

CPU: Intel(R) Core(TM) i7-6700HQ CPU
OS: Arch Linux x86_64 Kernel Version 4.11.9-1
Compilers: g++ (GCC) 7.1.1 20170630
clang version 4.0.1 (tags/RELEASE_401/final)

Test Code

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;

    uint64_t size = 10<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "Counter\t"  << count << "\nSpeed\t" << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

Code Generated by Clang

Compilation: clang++ poc.cpp -o poc_clang -O3 -march=native -std=c++14

[code stripped]

        popcnt	rcx, qword ptr [r15 + 8*rax]
	add	rcx, rbx
	popcnt	rdx, qword ptr [r15 + 8*rax + 8]
	add	rdx, rcx
	popcnt	rcx, qword ptr [r15 + 8*rax + 16]
	add	rcx, rdx
	popcnt	rdx, qword ptr [r15 + 8*rax + 24]
	add	rdx, rcx
	popcnt	rcx, qword ptr [r15 + 8*rax + 32]
	add	rcx, rdx
	popcnt	rdx, qword ptr [r15 + 8*rax + 40]
	add	rdx, rcx
	popcnt	rcx, qword ptr [r15 + 8*rax + 48]
	add	rcx, rdx
	popcnt	rbx, qword ptr [r15 + 8*rax + 56]

[code stripped]

In the above generated code, destination register of POPCNT is used in next instruction (write only). Due to false dependency, next line does not execute until destination register is ready for read (while we are only writing to it)

Code Generated by GCC

Compilation: g++ poc.cpp -o poc_gcc -O3 -march=native -std=c++14

[code stripped]

        xor	eax, eax
	xor	ecx, ecx
	popcnt	rax, QWORD PTR [rdx]
	popcnt	rcx, QWORD PTR 8[rdx]
	add	rax, rcx
	xor	ecx, ecx
	popcnt	rcx, QWORD PTR 16[rdx]
	add	rdx, 32
	add	rax, rcx
	xor	ecx, ecx
	popcnt	rcx, QWORD PTR -8[rdx]
	add	rax, rcx
	add	r12, rax
	cmp	rdx, r13

[code stripped]

In the code generated by GCC, false dependency is triggered in only 2 cases (in clang it is 7), resulting in faster performance.

The test code, dumped assembly code (dumped from compiler), and LLVM IR code is attached herewith (in ZIP)

@topperc
Copy link
Collaborator

topperc commented Jul 21, 2017

Are tzcnt and lzcnt subject to the same problem? The gcc patch for this here seems to indicate they are https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=214279

@adhokshajmishra
Copy link
Mannequin Author

adhokshajmishra mannequin commented Jul 21, 2017

I will test TZCNT and LZCNT and update here by tonight (Indian Standard Time)

@topperc
Copy link
Collaborator

topperc commented Jul 21, 2017

Adding Dave Kreitzer and Asaf Badouh. There had been a previous patch (https://reviews.llvm.org/D17289) to workaround this issue, but that review stalled and I don't know what happened after that.

Also moving this to the X86 library.

@adhokshajmishra
Copy link
Mannequin Author

adhokshajmishra mannequin commented Jul 21, 2017

I just tested LZCNT and TZCNT on Clang 4.0. These instructions do not appear to have a false dependency as stated in gcc patch. Probably the bug was fixed in newer chips (the linked patch is almost 3 years old).

However, GCC 6.3.1 seems to be generating suboptimal code for LZCNT and TZCNT (performance hit by ~40% as compared to Clang 4.0). I will submit a bug report to GCC once I am done with all the tests and PoC for GCC.

@topperc
Copy link
Collaborator

topperc commented Jul 21, 2017

Can you elaborate on the suboptimal code gcc is generating for lzcnt/tzcnt?

@adhokshajmishra
Copy link
Mannequin Author

adhokshajmishra mannequin commented Jul 21, 2017

Test code (TZCNT)

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _tzcnt_u64(buffer[i]);
                count += _tzcnt_u64(buffer[i+1]);
                count += _tzcnt_u64(buffer[i+2]);
                count += _tzcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _tzcnt_u64(buffer[i]);
                count += _tzcnt_u64(buffer[i+1]);
                count += _tzcnt_u64(buffer[i+2]);
                count += _tzcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

Compile command: g++ tzcnt.cpp -o tzcnt_gcc -O3 -march=native -std=c++14
clang++ tzcnt.cpp -o tzcnt_clang -O3 -march=native -std=c++14

Output:

./tzcnt_clang 4

unsigned 5244900000 2.19225 sec 19.1324 GB/s
uint64_t 5244900000 2.15938 sec 19.4236 GB/s

./tzcnt_gcc 4

unsigned 5244900000 2.50378 sec 16.7519 GB/s
uint64_t 5244900000 1.88483 sec 22.2529 GB/s

Clang is generating code with consistent performance which falls somewhere in middle of the performance counts produced by g++ code. I guess Clang too can be improved here.

Disassembly (Clang)

[code stripped]

.LBB0_13:                               #   Parent Loop BB0_12 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	tzcnt	rdx, qword ptr [r15 + 8*rdx]
	add	rdx, rbx
	lea	esi, [rcx - 3]
	tzcnt	rsi, qword ptr [r15 + 8*rsi]
	add	rsi, rdx
	lea	edx, [rcx - 2]
	tzcnt	rdx, qword ptr [r15 + 8*rdx]
	add	rdx, rsi
	lea	esi, [rcx - 1]
	tzcnt	rbx, qword ptr [r15 + 8*rsi]
	add	rbx, rdx
	mov	edx, ecx
	add	ecx, 4
	cmp	rdx, r13
	jb	.LBB0_13
# BB#14:                                #   in Loop: Header=BB0_12 Depth=1
	mov	ecx, 4
	xor	edx, edx
	.p2align	4, 0x90
.LBB0_15:                               #   Parent Loop BB0_12 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	tzcnt	rdx, qword ptr [r15 + 8*rdx]
	add	rdx, rbx
	lea	esi, [rcx - 3]
	tzcnt	rsi, qword ptr [r15 + 8*rsi]
	add	rsi, rdx
	lea	edx, [rcx - 2]
	tzcnt	rdx, qword ptr [r15 + 8*rdx]
	add	rdx, rsi
	lea	esi, [rcx - 1]
	tzcnt	rbx, qword ptr [r15 + 8*rsi]
	add	rbx, rdx
	mov	edx, ecx
	add	ecx, 4
	cmp	rdx, r13
	jb	.LBB0_15

[code stripped]

.LBB0_22:                               #   Parent Loop BB0_21 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	tzcnt	rdx, qword ptr [r15 + 8*rcx]
	add	rdx, rbx
	tzcnt	rsi, qword ptr [r15 + 8*rcx + 8]
	add	rsi, rdx
	tzcnt	rdx, qword ptr [r15 + 8*rcx + 16]
	add	rdx, rsi
	tzcnt	rsi, qword ptr [r15 + 8*rcx + 24]
	add	rsi, rdx
	tzcnt	rdx, qword ptr [r15 + 8*rcx + 32]
	add	rdx, rsi
	tzcnt	rsi, qword ptr [r15 + 8*rcx + 40]
	add	rsi, rdx
	tzcnt	rdx, qword ptr [r15 + 8*rcx + 48]
	add	rdx, rsi
	tzcnt	rbx, qword ptr [r15 + 8*rcx + 56]
	add	rbx, rdx
	add	rcx, 8
	cmp	rcx, r13
	jb	.LBB0_22

[code stripped]

Disassembly (GCC)

[code stripped]

.L8:
	xor	edi, edi
	tzcnt	rdi, QWORD PTR 0[rbp+rax*8]
	lea	eax, 1[rdx]
	xor	esi, esi
	lea	ecx, 2[rdx]
	tzcnt	rax, QWORD PTR 0[rbp+rax*8]
	tzcnt	rsi, QWORD PTR 0[rbp+rcx*8]
	lea	ecx, 3[rdx]
	tzcnt	rcx, QWORD PTR 0[rbp+rcx*8]
	add	rax, rdi
	add	rax, rsi
	add	rax, rcx
	add	r14, rax
	lea	eax, 4[rdx]
	mov	rdx, rax
	cmp	rax, r12
	jb	.L8

[code stripped]

.L13:
	xor	edi, edi
	xor	eax, eax
	tzcnt	rdi, QWORD PTR [rdx]
	xor	esi, esi
	tzcnt	rax, QWORD PTR 8[rdx]
	xor	ecx, ecx
	add	rdx, 32
	tzcnt	rsi, QWORD PTR -16[rdx]
	tzcnt	rcx, QWORD PTR -8[rdx]
	add	rax, rdi
	add	rax, rsi
	add	rax, rcx
	add	r13, rax
	cmp	rdx, r8
	jne	.L13

[code stripped]

In case of Clang, both loops produce similar performance, while with GCC, second loop if much faster than first one. I guess both compilers can be improved a bit in this case.

@adhokshajmishra
Copy link
Mannequin Author

adhokshajmishra mannequin commented Jul 21, 2017

Test code (LZCNT): replace _tzcnt_u64 with _lzcnt_u64

Output

./lzcnt_clang 1

unsigned 1306770000 0.502319 sec 20.8747 GB/s
uint64_t 1306770000 0.461493 sec 22.7214 GB/s

./lzcnt_gcc 1

unsigned 1306770000 0.543904 sec 19.2787 GB/s
uint64_t 1306770000 0.426979 sec 24.558 GB/s

Again, Clang is generating code with consistent performance which falls somewhere in middle of the performance counts produced by g++ code.

Disassembly (Clang)

[code stripped]
.LBB0_13: # Parent Loop BB0_12 Depth=1
# => This Inner Loop Header: Depth=2
lzcnt rdx, qword ptr [r15 + 8rdx]
add rdx, rbx
lea esi, [rcx - 3]
lzcnt rsi, qword ptr [r15 + 8
rsi]
add rsi, rdx
lea edx, [rcx - 2]
lzcnt rdx, qword ptr [r15 + 8rdx]
add rdx, rsi
lea esi, [rcx - 1]
lzcnt rbx, qword ptr [r15 + 8
rsi]
add rbx, rdx
mov edx, ecx
add ecx, 4
cmp rdx, r13
jb .LBB0_13

BB#14: # in Loop: Header=BB0_12 Depth=1

mov	ecx, 4
xor	edx, edx
.p2align	4, 0x90

.LBB0_15: # Parent Loop BB0_12 Depth=1
# => This Inner Loop Header: Depth=2
lzcnt rdx, qword ptr [r15 + 8rdx]
add rdx, rbx
lea esi, [rcx - 3]
lzcnt rsi, qword ptr [r15 + 8
rsi]
add rsi, rdx
lea edx, [rcx - 2]
lzcnt rdx, qword ptr [r15 + 8rdx]
add rdx, rsi
lea esi, [rcx - 1]
lzcnt rbx, qword ptr [r15 + 8
rsi]
add rbx, rdx
mov edx, ecx
add ecx, 4
cmp rdx, r13
jb .LBB0_15
[code stripped]
.LBB0_22: # Parent Loop BB0_21 Depth=1
# => This Inner Loop Header: Depth=2
lzcnt rdx, qword ptr [r15 + 8rcx]
add rdx, rbx
lzcnt rsi, qword ptr [r15 + 8
rcx + 8]
add rsi, rdx
lzcnt rdx, qword ptr [r15 + 8rcx + 16]
add rdx, rsi
lzcnt rsi, qword ptr [r15 + 8
rcx + 24]
add rsi, rdx
lzcnt rdx, qword ptr [r15 + 8rcx + 32]
add rdx, rsi
lzcnt rsi, qword ptr [r15 + 8
rcx + 40]
add rsi, rdx
lzcnt rdx, qword ptr [r15 + 8rcx + 48]
add rdx, rsi
lzcnt rbx, qword ptr [r15 + 8
rcx + 56]
add rbx, rdx
add rcx, 8
cmp rcx, r13
jb .LBB0_22
[code stripped]

Disassembly (GCC)

[code stripped]
.L8:
xor edi, edi
lzcnt rdi, QWORD PTR 0[rbp+rax8]
lea eax, 1[rdx]
lea ecx, 2[rdx]
lzcnt rax, QWORD PTR 0[rbp+rax
8]
add rax, rdi
xor esi, esi
lzcnt rsi, QWORD PTR 0[rbp+rcx8]
lea ecx, 3[rdx]
add rax, rsi
lzcnt rcx, QWORD PTR 0[rbp+rcx
8]
add rax, rcx
add r14, rax
lea eax, 4[rdx]
mov rdx, rax
cmp rax, r12
jb .L8
[code stripped]
.L13:
xor edi, edi
xor eax, eax
lzcnt rdi, QWORD PTR [rdx]
xor esi, esi
lzcnt rax, QWORD PTR 8[rdx]
add rax, rdi
lzcnt rsi, QWORD PTR 16[rdx]
xor ecx, ecx
add rax, rsi
lzcnt rcx, QWORD PTR 24[rdx]
add rdx, 32
add rax, rcx
add r13, rax
cmp rdx, r8
jne .L13
[code stripped]

@adhokshajmishra
Copy link
Mannequin Author

adhokshajmishra mannequin commented Jul 21, 2017

@llvmbot
Copy link
Member

llvmbot commented Jul 21, 2017

Adding Dave Kreitzer and Asaf Badouh. There had been a previous patch
(https://reviews.llvm.org/D17289) to workaround this issue, but that review
stalled and I don't know what happened after that.

Also moving this to the X86 library.

AFAIK, my comments in https://reviews.llvm.org/D17289 reflect the latest status/plans:

The best/cheapest way to fix the popcnt false dependence is via biased register selection in the RA such that the false dependence is "masked" by a true dependence. That won't work in this case, though, since all the input registers to the popcnt instructions interfere with the outputs.

Failing that, we should conditionally generate an xor to break the dependence. The logic for deciding when it is profitable to generate the xor already exists in the ExecutionDepsFix pass for instructions like X86::SQRTSDr. So the plan was to enhance that pass to catch this popcnt case.

At one point, this was on Marina's to-do, so I am adding her.

@topperc
Copy link
Collaborator

topperc commented Jul 21, 2017

Looks like the execution dependency fix pass currently only wants to deal with VR128 registers and their superclasses. Not sure if we should add another copy for GR32 or teach it's constructor to take a list of register classes. Or maybe the partial/undef register parts should be separated from execution domain?

@llvmbot
Copy link
Member

llvmbot commented Jul 21, 2017

Looks like the execution dependency fix pass currently only wants to deal
with VR128 registers and their superclasses. Not sure if we should add
another copy for GR32 or teach it's constructor to take a list of register
classes. Or maybe the partial/undef register parts should be separated from
execution domain?

Separating the false dependence avoidance parts from the execution domain parts seems like a good idea to me. But I admit I haven't studied the details of the pass enough to know what we are saving by combining the two.

It is also worth pointing out that the X86FixupBWInsts pass plays a role here. If we were to separate out the false dependence avoidance functionality, it would good to fold in the X86FixupBWInsts functionality, since it is trying to accomplish the same thing.

@jlebar
Copy link
Member

jlebar commented Oct 13, 2017

*** Bug llvm/llvm-bugzilla-archive#34936 has been marked as a duplicate of this bug. ***

@chandlerc
Copy link
Member

FYI, we may be hitting this issue with LZCNT these days. Any progress on resolving this?

@atdt
Copy link

atdt commented Oct 23, 2017

Stand-alone test case showing regression for -march=haswell
Attached an additional, stand-alone test case showing a performance regression when the same code is compiled for Haswell vs. older microarchitectures (like corei7).

@llvmbot
Copy link
Member

llvmbot commented Oct 25, 2017

I've started working on a solution.

Basically, the infrastructure is there - ExecutionDepsFix has both the ability to add an XOR if there is not enough clearance (e.g not enough instructions since the last write to this reg) and the ability to hide a false dependency behind a true dependency if possible.
As mentioned, currently it is only adjusted to VR128XRegClass, but after playing around with it - it can be adjusted to other reg classes too without too much trouble.

Another issue that was suggested is breaking ExecutionDepsFix into 2 passes, one that handles execution domains and one that handles dependency.
These 2 problems are different and the only logic in this pass that combined the 2 issues is the way we traverse the basic blocks (and even there there is a bit of difference between the 2). However, even if we separate them, the "break false deps" part cannot be lowered into X86FixupBWInsts because this logic is not target specific.

In the long run it is probably better to separate the 2 passes, even if we cannot combine it with X86FixupBWInsts . Do you agree?

@topperc
Copy link
Collaborator

topperc commented Nov 1, 2017

I think splitting it into a separate pass makes sense.

Do any other targets implement getPartialRegUpdateClearance/getUndefRegUpdateClearance? If so those targets will also need to run the new pass which would be mean we couldn't merge into FixupBWInsts anyway

@llvmbot
Copy link
Member

llvmbot commented Nov 1, 2017

As far as I can see, ARM implements it in ARMBaseInstrInfo::getPartialRegUpdateClearance().

Regarding getUndefRegClearance(), it seems only x86 target implements this.

@llvmbot
Copy link
Member

llvmbot commented Nov 14, 2017

Running iaca over uploaded lzcnt_gcc.s and lzcnt_clang.s is giving following results

Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
Analyzed File - ./lzcnt.gcc
Binary Format - 64Bit
Architecture - SKX
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 4.71 Cycles Throughput Bottleneck: FrontEnd

Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
Analyzed File - lzcnt.clang
Binary Format - 64Bit
Architecture - SKX
Analysis Type - Throughput

Throughput Analysis Report

Block Throughput: 4.24 Cycles Throughput Bottleneck: FrontEnd

Is IACA not considering false pependency in its compurations.

Uploading the detailed results.

@llvmbot
Copy link
Member

llvmbot commented Nov 14, 2017

IACA results

@llvmbot
Copy link
Member

llvmbot commented Nov 14, 2017

My patch is almost done and it handles popcnt, lzcnt, tzcnt.
It will be upload in the next few days.

@llvmbot
Copy link
Member

llvmbot commented Nov 22, 2017

I've uploaded the following reviews:

https://reviews.llvm.org/D40330 - Separate ExecutionDepsFix into 3 parts
https://reviews.llvm.org/D40331 - clang-format ExecutionDepsFix
https://reviews.llvm.org/D40332 - Rename ExecutionDepsFix file to ExecutionDomainFix
https://reviews.llvm.org/D40333 - Separate LoopTraversal and BreakFalseDeps out of ExecutionDomainFix into their own files
https://reviews.llvm.org/D40334 - [X86] Break false dependencies for POPCNT, LZCNT, TZCNT

These reviews are built one on top of the other and try to make the logic separation and file renaming readable.
Most of the changes are in the first and last reviews, the 3 middle reviews are more for cosmetics.

@atdt
Copy link

atdt commented Dec 6, 2017

Thank you for your work on this issue, Marina! I look forward to seeing it land.

@llvmbot
Copy link
Member

llvmbot commented Jan 22, 2018

Committed a fix in rL323096.

@atdt
Copy link

atdt commented Mar 20, 2018

As far as I can tell, clang still does not break the dependency in the reproduction case I attached in comment 14. Minimally:

#include <cstdint>
#include <x86intrin.h>

__attribute__((noinline))
int msb(uint64_t n) {
	return 63 ^ __builtin_clzll(n);
}

clang version 7.0.0 (trunk 327823), -O2 -march=haswell:

lzcnt rax, rdi
xor eax, 63
ret

g++ 8.0.1 20180319, -O2 -march=haswell:

xor eax, eax
lzcnt rax, rdi
xor eax, 63
ret

https://godbolt.org/g/JC57Ri

The failure to break the dependency chain causes a measurable degradation in performance when the function is called in a loop. I tested on one Haswell machine and one Broadwell machine.

Worse, clang is shooting itself in the foot. If you compile the same code but target an older microarchitecture w/no lzcnt (-march=core-i7 for example), clang emits a bsr instruction instead, which doesn't appear to suffer from this false dependency issue.

@topperc
Copy link
Collaborator

topperc commented Mar 20, 2018

I'll look into this, but why are you preventing the inlining of the function.

@llvmbot
Copy link
Member

llvmbot commented Mar 21, 2018

We have an issue with breaking false dependencies on function entry/calls (well, basically we don't do that while other compilers do). What you described maps to this and not to lzcnt specific issue.
There was a review by Keno a year ago that was suppose to solve this:
https://reviews.llvm.org/D28786

@atdt
Copy link

atdt commented Apr 4, 2018

I'll look into this, but why are you preventing the inlining of the function.

Only for the sake of having a minimal reproduction case. We've run into this issue normal code that is not forcing or preventing inlining.

@atdt
Copy link

atdt commented Apr 4, 2018

We have an issue with breaking false dependencies on function entry/calls
(well, basically we don't do that while other compilers do). What you
described maps to this and not to lzcnt specific issue.
There was a review by Keno a year ago that was suppose to solve this:
https://reviews.llvm.org/D28786

Thanks for the explanation. Would you like me to file a separate bug for that?

@adhokshajmishra
Copy link
Mannequin Author

adhokshajmishra mannequin commented Jan 17, 2019

I just tested this with clang 7.0.1, and the compiler is still generating suboptimal code for TZCNT and LZCNT. Separate bug report has been filed for POPCNT where it is generating suboptimal vectorized code for Intel(R) Core(TM) i7-6700HQ CPU. Please refer to #​40346.

@jlebar
Copy link
Member

jlebar commented Nov 26, 2021

mentioned in issue llvm/llvm-bugzilla-archive#34936

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

6 participants
@jlebar @atdt @chandlerc @topperc @llvmbot and others