-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal code generation for __builtin_ctz(ll) #34191
Comments
The false dependency was fixed in GCC here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011 Despite the bug talking about popcount, the patch from an Intel engineer also addressed CTZ, and the patch applies for Sandy Bridge and Haswell micro-architectures (and the generic target). I checked and with -march=haswell Clang seems to generate a tzcnt without the false dependency elimination. So there's a 2nd optimization opportunity here. |
There's discussion about the popcnt hardware bug and possible work-arounds in #33216. |
mentioned in issue llvm/llvm-bugzilla-archive#36881 |
Candidate Patch: https://reviews.llvm.org/D130956 |
/cherry-pick c2066d1 |
/branch llvm/llvm-project-release-prs/issue34191 |
/pull-request llvm/llvm-project-release-prs#55 |
sounds like this was reverted in 84e9194 |
Turns out that's not quite the case. GCC docs simply state that "the result is undefined" instead of the "the behavior/program is undefined". As long as you're not using the result in the problematic case you're fine. Similarly The |
Given the risk of the first landing, I'm content to let this ride the release trains out to clang-16. Thanks again for the patch and fix, @phoebewang ! |
I've dropped this from the 15.0.0 milestone. It can always be re-added we change our minds. |
Extended Description
Right now, when no specific arch target is set, the builtin
__builtin_ctz (and long, long long variants)
will generate a bsf instruction.
This is suboptimal for AMD machines, which can do a TZCNT much faster than they can do a BSF. Due to the way TZCNT is encoded, it is equal to a REP BSF, so it is in fact "backwards compatible" as long as the different behavior for a 0 is fine. And it is, because __builtin_ctz has undefined behavior for 0 (which is why it can use BSF in the first place).
On Intel hardware, either way is equally fast, so for a generic target it makes sense to deal with the AMD case and encode the intrinsic as REP BSF/TZNCT.
At least GCC 4.8 and later are able to do this optimization and generate a REP BSF for their generic target. Clang fails to do so. (It does generate TZCNT with -march=znver1)
Example snippet:
https://godbolt.org/g/eXU6xf
Of note in this snippet is also that newer GCC adds a XOR ESI, ESI before the REP BSF. So there may be a false dependency issue in some CPUs.
The text was updated successfully, but these errors were encountered: