[X86] movb is extremely expensive on modern processors #62948

ryao · 2023-05-26T05:19:11Z

Lately, I have been trying to micro-optimize a binary search function that operates on 4KB sized arrays. Doing some experiments with loop unrolling yielded three alternative implementations.

https://gcc.godbolt.org/z/79rq7sqcf

They all should perform similarly, yet on my Ryzen 7 5800X, they do not:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685077567, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|              custom_binary_search_unroll_switch_v1 |       1000 |        980 |       9020 |   0.000220 |
|              custom_binary_search_unroll_switch_v2 |       1000 |        980 |       9020 |   0.000564 |
|              custom_binary_search_unroll_switch_v3 |       1000 |        980 |       9020 |   0.000567 |

I apply the following patch to the assembly code. Then assemble and link the result:

diff --git a/out.s b/out.s
index f42f44c..24d6940 100644
--- a/out.s
+++ b/out.s
@@ -346,8 +346,8 @@ custom_binary_search_unroll_switch_v2:  # @custom_binary_search_unroll_switch_v2
        bsrl    %eax, %r8d
        movl    %r8d, %eax
        xorl    $31, %eax
-       movb    $30, %cl
-       subb    %al, %cl
+       movl    $30, %ecx
+       subl    %eax, %ecx
        movl    $1, %eax
        shll    %cl, %eax
        leal    -1(%rax), %ecx
@@ -481,8 +481,8 @@ custom_binary_search_unroll_switch_v3:  # @custom_binary_search_unroll_switch_v3
        bsrl    %eax, %eax
        movl    %eax, %r8d
        xorl    $31, %r8d
-       movb    $30, %cl
-       subb    %r8b, %cl
+       movl    $30, %ecx
+       subl    %r8d, %ecx
        movl    $1, %r8d
        shll    %cl, %r8d
        movl    %esi, %r9d

Now, when I rerun the micro-benchmark, all 3 perform similarly:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685077784, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|              custom_binary_search_unroll_switch_v1 |       1000 |        945 |       9055 |   0.000224 |
|              custom_binary_search_unroll_switch_v2 |       1000 |        945 |       9055 |   0.000206 |
|              custom_binary_search_unroll_switch_v3 |       1000 |        945 |       9055 |   0.000202 |

If I change the patch to only change subb to subl, the performance remains unchanged.

If I compile with GCC 12.2, I see an even better execution time on the last one:

Benchmark: array size: 1024, runs: 1000, repetitions: 10000, seed: 1685077837, density: 10

Even distribution with 1024 32 bit integers, random access

|                                               Name |      Items |       Hits |     Misses |       Time |
|                                         ---------- | ---------- | ---------- | ---------- | ---------- |
|              custom_binary_search_unroll_switch_v1 |       1000 |        972 |       9028 |   0.000320 |
|              custom_binary_search_unroll_switch_v2 |       1000 |        972 |       9028 |   0.000340 |
|              custom_binary_search_unroll_switch_v3 |       1000 |        972 |       9028 |   0.000178 |

That is unsuprising, considering that GCC emits fewer instructions for the switch statement body in v3, while it emits many more in v1 and v2. Here is the first case statement from GCC for v3:

.L54:
        leal    512(%rdx), %ecx
        cmpl    %esi, (%rdi,%rcx,4)
        cmovle  %ecx, %edx

And the first case statement from LLVM for v3:

        leal    512(%rcx), %eax
        cmpl    %edx, (%rdi,%rax,4)
        cmovgl  %ecx, %eax
        movl    %eax, %ecx

In any case, passing -mtune=znver3 does not stop LLVM from using movb and subb. Whatever optimization pass is opportunistically lowering operations to byte operations should be made to stop doing that on AMD64.

Slightly smaller code size is not worth it when it kills performance on a popular AMD64 processor family. In this case, using movb saves 3 bytes, while subb and subl are the same number of bytes.

The text was updated successfully, but these errors were encountered:

amonakov · 2023-05-26T06:40:50Z

The problem is mischaracterized (and not specific to Zen 3). The issue is that movb $30, %cl is not a dependency-breaking instruction (AMD and recent Intels do not rename low 8-bit partial registers separately, and since the unstruction keeps bits 8-31 of %ecx unmodiifed, this creates a dependency on the previous value of %ecx). Since the movb is heading a critical dependency chain, the issue is very noticeable.

You'll see the same speedup if you insert a dependency-breaking xor %ecx, %ecx before the offending movb.

ryao · 2023-05-26T07:30:13Z

@amonakov You are right. Inserting xor %ecx, %ecx before movb also lets us save 1 byte of space versus using movl.

Would the solution be for LLVM to emit an xor before emitting movb whenever the previous value of the upper register is not needed?

Also, would it be expected that llvm-mca does not report a dependency issue here?

llvmbot · 2023-05-26T08:51:13Z

@llvm/issue-subscribers-backend-x86

ryao · 2023-05-26T16:14:13Z

Out of curiosity, I tested movw+subw. Performance was identical to movb+subb.

adibiagio · 2023-05-29T12:48:02Z

@amonakov You are right. Inserting xor %ecx, %ecx before movb also lets us save 1 byte of space versus using movl.

Also, would it be expected that llvm-mca does not report a dependency issue here?

That’s not expected. On x86 we model partial register updates, and there are tests for it. If you think that there is a problem with it, then I suggest to file a separate mca bug.

ValeZAA · 2023-06-11T17:02:36Z

smaller code size is not worth it when it kills performance on a popular AMD64 processor family.

So does it happen on Intel?

AMD and recent Intels do not rename low 8-bit partial registers separately

Is that a bug in silicon? Because popcnt false dependency was fixed by Intel on newer CPUs. Not fixed in clang yet, see #33216
Or did Intel make silicon closer to AMD?

github-actions bot added the new issue label May 26, 2023

ryao changed the title ~~Emitting byte operations is extremely expensive on Zen 3~~ [amd64] 8-bit operations are extremely expensive on Zen 3 May 26, 2023

ryao changed the title ~~[amd64] 8-bit operations are extremely expensive on Zen 3~~ [amd64] movb is extremely expensive on Zen 3 May 26, 2023

RKSimon added backend:X86 backend:X86 Scheduler Models Accuracy of X86 scheduler models and removed new issue labels May 26, 2023

RKSimon changed the title ~~[amd64] movb is extremely expensive on Zen 3~~ [X86] movb is extremely expensive on Zen 3 May 26, 2023

RKSimon self-assigned this May 26, 2023

ryao mentioned this issue May 26, 2023

btree: Reduce opportunities for branch mispredictions in binary search openzfs/zfs#14866

Merged

13 tasks

ryao changed the title ~~[X86] movb is extremely expensive on Zen 3~~ [X86] movb is extremely expensive on modern processors May 26, 2023

This was referenced May 27, 2023

Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790

Open

Add shar's algorithm scandum/binary_search#6

Open

RKSimon removed their assignment Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] movb is extremely expensive on modern processors #62948

[X86] movb is extremely expensive on modern processors #62948

ryao commented May 26, 2023 •

edited by VoltrexKeyva

Loading

amonakov commented May 26, 2023

ryao commented May 26, 2023 •

edited

Loading

llvmbot commented May 26, 2023

ryao commented May 26, 2023

adibiagio commented May 29, 2023

ValeZAA commented Jun 11, 2023 •

edited

Loading

[X86] movb is extremely expensive on modern processors #62948

[X86] movb is extremely expensive on modern processors #62948

Comments

ryao commented May 26, 2023 • edited by VoltrexKeyva Loading

amonakov commented May 26, 2023

ryao commented May 26, 2023 • edited Loading

llvmbot commented May 26, 2023

ryao commented May 26, 2023

adibiagio commented May 29, 2023

ValeZAA commented Jun 11, 2023 • edited Loading

ryao commented May 26, 2023 •

edited by VoltrexKeyva

Loading

ryao commented May 26, 2023 •

edited

Loading

ValeZAA commented Jun 11, 2023 •

edited

Loading