-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] movb is extremely expensive on modern processors #62948
Comments
The problem is mischaracterized (and not specific to Zen 3). The issue is that You'll see the same speedup if you insert a dependency-breaking |
@amonakov You are right. Inserting Would the solution be for LLVM to emit an xor before emitting movb whenever the previous value of the upper register is not needed? Also, would it be expected that llvm-mca does not report a dependency issue here? |
@llvm/issue-subscribers-backend-x86 |
Out of curiosity, I tested movw+subw. Performance was identical to movb+subb. |
That’s not expected. On x86 we model partial register updates, and there are tests for it. If you think that there is a problem with it, then I suggest to file a separate mca bug. |
So does it happen on Intel?
Is that a bug in silicon? Because popcnt false dependency was fixed by Intel on newer CPUs. Not fixed in clang yet, see #33216 |
Lately, I have been trying to micro-optimize a binary search function that operates on 4KB sized arrays. Doing some experiments with loop unrolling yielded three alternative implementations.
https://gcc.godbolt.org/z/79rq7sqcf
They all should perform similarly, yet on my Ryzen 7 5800X, they do not:
I apply the following patch to the assembly code. Then assemble and link the result:
Now, when I rerun the micro-benchmark, all 3 perform similarly:
If I change the patch to only change
subb
tosubl
, the performance remains unchanged.If I compile with GCC 12.2, I see an even better execution time on the last one:
That is unsuprising, considering that GCC emits fewer instructions for the switch statement body in v3, while it emits many more in v1 and v2. Here is the first case statement from GCC for v3:
And the first case statement from LLVM for v3:
In any case, passing
-mtune=znver3
does not stop LLVM from usingmovb
andsubb
. Whatever optimization pass is opportunistically lowering operations to byte operations should be made to stop doing that on AMD64.Slightly smaller code size is not worth it when it kills performance on a popular AMD64 processor family. In this case, using
movb
saves 3 bytes, whilesubb
andsubl
are the same number of bytes.The text was updated successfully, but these errors were encountered: