Suboptimal code generation for __builtin_ctz(ll) #34191

gcp · 2017-10-05T09:55:50Z


Bugzilla Link	34843
Version	5.0
OS	Linux
CC	@topperc,@RKSimon,@rotateright

Extended Description

Right now, when no specific arch target is set, the builtin

__builtin_ctz (and long, long long variants)

will generate a bsf instruction.

This is suboptimal for AMD machines, which can do a TZCNT much faster than they can do a BSF. Due to the way TZCNT is encoded, it is equal to a REP BSF, so it is in fact "backwards compatible" as long as the different behavior for a 0 is fine. And it is, because __builtin_ctz has undefined behavior for 0 (which is why it can use BSF in the first place).

On Intel hardware, either way is equally fast, so for a generic target it makes sense to deal with the AMD case and encode the intrinsic as REP BSF/TZNCT.

At least GCC 4.8 and later are able to do this optimization and generate a REP BSF for their generic target. Clang fails to do so. (It does generate TZCNT with -march=znver1)

Example snippet:
https://godbolt.org/g/eXU6xf

Of note in this snippet is also that newer GCC adds a XOR ESI, ESI before the REP BSF. So there may be a false dependency issue in some CPUs.

gcp · 2017-10-05T10:04:13Z

The false dependency was fixed in GCC here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Despite the bug talking about popcount, the patch from an Intel engineer also addressed CTZ, and the patch applies for Sandy Bridge and Haswell micro-architectures (and the generic target).

I checked and with -march=haswell Clang seems to generate a tzcnt without the false dependency elimination. So there's a 2nd optimization opportunity here.

rotateright · 2017-10-05T14:34:12Z

There's discussion about the popcnt hardware bug and possible work-arounds in #33216.

topperc · 2021-11-27T00:29:29Z

mentioned in issue llvm/llvm-bugzilla-archive#36881

RKSimon · 2022-08-02T11:58:10Z

Candidate Patch: https://reviews.llvm.org/D130956

nickdesaulniers · 2022-08-03T16:57:18Z

/cherry-pick c2066d1

llvmbot · 2022-08-03T17:04:35Z

/branch llvm/llvm-project-release-prs/issue34191

llvmbot · 2022-08-03T17:11:40Z

/pull-request llvm/llvm-project-release-prs#55

nickdesaulniers · 2022-08-03T22:00:20Z

sounds like this was reverted in 84e9194

aaronpuchert · 2022-08-04T00:04:42Z

[...] __builtin_ctz has undefined behavior for 0 (which is why it can use BSF in the first place).

Turns out that's not quite the case. GCC docs simply state that "the result is undefined" instead of the "the behavior/program is undefined". As long as you're not using the result in the problematic case you're fine. Similarly llvm.cttz.* is specified to return poison for zero input.

The bsf instruction similarly leaves the destination operand undefined, but still sets ZF to indicate zero input. One might want to use this flag to overwrite the undefined result in the zero input case. But here it gets tricky, because tzcnt sets CF instead, so if we always emit tzcnt without knowing the machine that will run the code, we don't know which flag to look at.

nickdesaulniers · 2022-08-08T21:02:27Z

Given the risk of the first landing, I'm content to let this ride the release trains out to clang-16. Thanks again for the patch and fix, @phoebewang !

tstellar · 2022-08-08T22:56:57Z

I've dropped this from the 15.0.0 milestone. It can always be re-added we change our minds.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

RKSimon mentioned this issue Jul 29, 2022

Odd inline asm code generation with pointless memory operand #56789

Closed

nickdesaulniers assigned phoebewang Aug 2, 2022

phoebewang closed this as completed in c2066d1 Aug 3, 2022

EugeneZelenko added the mc Machine (object) code label Aug 3, 2022

nickdesaulniers reopened this Aug 3, 2022

EugeneZelenko added this to the LLVM 15.0.0 Release milestone Aug 3, 2022

EugeneZelenko added the release:backport label Aug 3, 2022

llvmbot mentioned this issue Aug 3, 2022

PR for llvm/llvm-project#34191 llvm/llvm-project-release-prs#55

Closed

nikic added this to LLVM Release Status Aug 3, 2022

nikic moved this to Needs Triage in LLVM Release Status Aug 3, 2022

nikic moved this from Needs Triage to Needs Review in LLVM Release Status Aug 3, 2022

phoebewang closed this as completed in 7f648d2 Aug 5, 2022

tstellar removed this from LLVM Release Status Aug 8, 2022

tstellar removed this from the LLVM 15.0.0 Release milestone Aug 8, 2022

EugeneZelenko removed the release:backport label Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal code generation for __builtin_ctz(ll) #34191

Suboptimal code generation for __builtin_ctz(ll) #34191

gcp mannequin commented Oct 5, 2017

gcp mannequin commented Oct 5, 2017

rotateright commented Oct 5, 2017 •

edited by aaronpuchert

Loading

topperc commented Nov 27, 2021

RKSimon commented Aug 2, 2022

nickdesaulniers commented Aug 3, 2022

llvmbot commented Aug 3, 2022

llvmbot commented Aug 3, 2022

nickdesaulniers commented Aug 3, 2022

aaronpuchert commented Aug 4, 2022 •

edited

Loading

nickdesaulniers commented Aug 8, 2022

tstellar commented Aug 8, 2022

Suboptimal code generation for __builtin_ctz(ll) #34191

Suboptimal code generation for __builtin_ctz(ll) #34191

Comments

gcp mannequin commented Oct 5, 2017

Extended Description

gcp mannequin commented Oct 5, 2017

rotateright commented Oct 5, 2017 • edited by aaronpuchert Loading

topperc commented Nov 27, 2021

RKSimon commented Aug 2, 2022

nickdesaulniers commented Aug 3, 2022

llvmbot commented Aug 3, 2022

llvmbot commented Aug 3, 2022

nickdesaulniers commented Aug 3, 2022

aaronpuchert commented Aug 4, 2022 • edited Loading

nickdesaulniers commented Aug 8, 2022

tstellar commented Aug 8, 2022

rotateright commented Oct 5, 2017 •

edited by aaronpuchert

Loading

aaronpuchert commented Aug 4, 2022 •

edited

Loading