Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal code generation for __builtin_ctz(ll) #34191

Closed
gcp mannequin opened this issue Oct 5, 2017 · 11 comments
Closed

Suboptimal code generation for __builtin_ctz(ll) #34191

gcp mannequin opened this issue Oct 5, 2017 · 11 comments
Assignees
Labels
backend:X86 bugzilla Issues migrated from bugzilla mc Machine (object) code

Comments

@gcp
Copy link
Mannequin

gcp mannequin commented Oct 5, 2017

Bugzilla Link 34843
Version 5.0
OS Linux
CC @topperc,@RKSimon,@rotateright

Extended Description

Right now, when no specific arch target is set, the builtin

__builtin_ctz (and long, long long variants)

will generate a bsf instruction.

This is suboptimal for AMD machines, which can do a TZCNT much faster than they can do a BSF. Due to the way TZCNT is encoded, it is equal to a REP BSF, so it is in fact "backwards compatible" as long as the different behavior for a 0 is fine. And it is, because __builtin_ctz has undefined behavior for 0 (which is why it can use BSF in the first place).

On Intel hardware, either way is equally fast, so for a generic target it makes sense to deal with the AMD case and encode the intrinsic as REP BSF/TZNCT.

At least GCC 4.8 and later are able to do this optimization and generate a REP BSF for their generic target. Clang fails to do so. (It does generate TZCNT with -march=znver1)

Example snippet:
https://godbolt.org/g/eXU6xf

Of note in this snippet is also that newer GCC adds a XOR ESI, ESI before the REP BSF. So there may be a false dependency issue in some CPUs.

@gcp
Copy link
Mannequin Author

gcp mannequin commented Oct 5, 2017

The false dependency was fixed in GCC here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Despite the bug talking about popcount, the patch from an Intel engineer also addressed CTZ, and the patch applies for Sandy Bridge and Haswell micro-architectures (and the generic target).

I checked and with -march=haswell Clang seems to generate a tzcnt without the false dependency elimination. So there's a 2nd optimization opportunity here.

@rotateright
Copy link
Contributor

rotateright commented Oct 5, 2017

There's discussion about the popcnt hardware bug and possible work-arounds in #33216.

@topperc
Copy link
Collaborator

topperc commented Nov 27, 2021

mentioned in issue llvm/llvm-bugzilla-archive#36881

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@RKSimon
Copy link
Collaborator

RKSimon commented Aug 2, 2022

Candidate Patch: https://reviews.llvm.org/D130956

@EugeneZelenko EugeneZelenko added the mc Machine (object) code label Aug 3, 2022
@nickdesaulniers
Copy link
Member

/cherry-pick c2066d1

@llvmbot
Copy link
Member

llvmbot commented Aug 3, 2022

/branch llvm/llvm-project-release-prs/issue34191

@llvmbot
Copy link
Member

llvmbot commented Aug 3, 2022

/pull-request llvm/llvm-project-release-prs#55

@nikic nikic moved this to Needs Triage in LLVM Release Status Aug 3, 2022
@nikic nikic moved this from Needs Triage to Needs Review in LLVM Release Status Aug 3, 2022
@nickdesaulniers
Copy link
Member

sounds like this was reverted in 84e9194

@aaronpuchert
Copy link
Member

aaronpuchert commented Aug 4, 2022

[...] __builtin_ctz has undefined behavior for 0 (which is why it can use BSF in the first place).

Turns out that's not quite the case. GCC docs simply state that "the result is undefined" instead of the "the behavior/program is undefined". As long as you're not using the result in the problematic case you're fine. Similarly llvm.cttz.* is specified to return poison for zero input.

The bsf instruction similarly leaves the destination operand undefined, but still sets ZF to indicate zero input. One might want to use this flag to overwrite the undefined result in the zero input case. But here it gets tricky, because tzcnt sets CF instead, so if we always emit tzcnt without knowing the machine that will run the code, we don't know which flag to look at.

@nickdesaulniers
Copy link
Member

Given the risk of the first landing, I'm content to let this ride the release trains out to clang-16. Thanks again for the patch and fix, @phoebewang !

@tstellar tstellar removed this from the LLVM 15.0.0 Release milestone Aug 8, 2022
@tstellar
Copy link
Collaborator

tstellar commented Aug 8, 2022

I've dropped this from the 15.0.0 milestone. It can always be re-added we change our minds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla mc Machine (object) code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants