Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression of array::IntoIter vs slice::Iter #115339

Closed
DaniPopes opened this issue Aug 29, 2023 · 6 comments · Fixed by #115515
Closed

Performance regression of array::IntoIter vs slice::Iter #115339

DaniPopes opened this issue Aug 29, 2023 · 6 comments · Fixed by #115515
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@DaniPopes
Copy link
Contributor

DaniPopes commented Aug 29, 2023

Code

I tried this code:

use core::iter;

pub fn array(mut a: [u8; 32], b: [u8; 32]) -> [u8; 32] {
    iter::zip(&mut a, b).for_each(|(a, b)| *a |= b);
    a
}

pub fn slice(mut a: [u8; 32], b: [u8; 32]) -> [u8; 32] {
    iter::zip(&mut a, &b).for_each(|(a, b)| *a |= *b);
    a
}

I expected to see this happen: same codegen or similar in performance

Instead, this happened: array has way worse codegen

godbolt

Assembly

example::array:
        vmovdqu xmm1, xmmword ptr [rdx]
        vmovdqu xmm0, xmmword ptr [rdx + 16]
        mov     rax, rdi
        vmovd   ecx, xmm1
        or      byte ptr [rsi], cl
        vpextrb ecx, xmm1, 1
        or      byte ptr [rsi + 1], cl
        vpextrb ecx, xmm1, 2
        or      byte ptr [rsi + 2], cl
        vpextrb ecx, xmm1, 3
        or      byte ptr [rsi + 3], cl
        vpextrb ecx, xmm1, 4
        or      byte ptr [rsi + 4], cl
        vpextrb ecx, xmm1, 5
        or      byte ptr [rsi + 5], cl
        vpextrb ecx, xmm1, 6
        or      byte ptr [rsi + 6], cl
        vpextrb ecx, xmm1, 7
        or      byte ptr [rsi + 7], cl
        vpextrb ecx, xmm1, 8
        or      byte ptr [rsi + 8], cl
        vpextrb ecx, xmm1, 9
        or      byte ptr [rsi + 9], cl
        vpextrb ecx, xmm1, 10
        or      byte ptr [rsi + 10], cl
        vpextrb ecx, xmm1, 11
        or      byte ptr [rsi + 11], cl
        vpextrb ecx, xmm1, 12
        or      byte ptr [rsi + 12], cl
        vpextrb ecx, xmm1, 13
        or      byte ptr [rsi + 13], cl
        vpextrb ecx, xmm1, 14
        or      byte ptr [rsi + 14], cl
        vpextrb ecx, xmm1, 15
        or      byte ptr [rsi + 15], cl
        vmovd   ecx, xmm0
        or      byte ptr [rsi + 16], cl
        vpextrb ecx, xmm0, 1
        or      byte ptr [rsi + 17], cl
        vpextrb ecx, xmm0, 2
        or      byte ptr [rsi + 18], cl
        vpextrb ecx, xmm0, 3
        or      byte ptr [rsi + 19], cl
        vpextrb ecx, xmm0, 4
        or      byte ptr [rsi + 20], cl
        vpextrb ecx, xmm0, 5
        or      byte ptr [rsi + 21], cl
        vpextrb ecx, xmm0, 6
        or      byte ptr [rsi + 22], cl
        vpextrb ecx, xmm0, 7
        or      byte ptr [rsi + 23], cl
        vpextrb ecx, xmm0, 8
        or      byte ptr [rsi + 24], cl
        vpextrb ecx, xmm0, 9
        or      byte ptr [rsi + 25], cl
        vpextrb ecx, xmm0, 10
        or      byte ptr [rsi + 26], cl
        vpextrb ecx, xmm0, 11
        or      byte ptr [rsi + 27], cl
        vpextrb ecx, xmm0, 12
        or      byte ptr [rsi + 28], cl
        vpextrb ecx, xmm0, 13
        or      byte ptr [rsi + 29], cl
        vpextrb ecx, xmm0, 14
        or      byte ptr [rsi + 30], cl
        vpextrb ecx, xmm0, 15
        or      byte ptr [rsi + 31], cl
        vmovups ymm0, ymmword ptr [rsi]
        vmovups ymmword ptr [rdi], ymm0
        vzeroupper
        ret

example::slice:
        vmovups ymm0, ymmword ptr [rsi]
        mov     rax, rdi
        vorps   ymm0, ymm0, ymmword ptr [rdx]
        vmovups ymmword ptr [rsi], ymm0
        vmovups ymmword ptr [rdi], ymm0
        vzeroupper
        ret

Version it worked on

It most recently worked on: 1.64

Version with regression

1.65 till now.

Regressed in nightly-2022-08-13, maybe LLVM 15 #99464?:

fetching (via local git) commits from 20ffea6938b5839c390252e07940b99e3b6a889a to f22819bcce4abaff7d1246a56eec493418f9f4ee

found 8 bors merge commits in the specified range
  commit[0] 2022-08-11: Auto merge of #100416 - Dylan-DPC:rollup-m344lh1, r=Dylan-DPC
  commit[1] 2022-08-11: Auto merge of #100426 - matthiaskrgr:rollup-0ks4dou, r=matthiaskrgr
  commit[2] 2022-08-12: Auto merge of #100419 - flip1995:clippyup, r=Manishearth
  commit[3] 2022-08-12: Auto merge of #99464 - nikic:llvm-15, r=cuviper
  commit[4] 2022-08-12: Auto merge of #100435 - ehuss:update-cargo, r=ehuss
  commit[5] 2022-08-12: Auto merge of #99624 - vincenzopalazzo:macros/unix_error, r=Amanieu
  commit[6] 2022-08-12: Auto merge of #100328 - davidtwco:perf-implications, r=nnethercote
  commit[7] 2022-08-12: Auto merge of #100456 - Dylan-DPC:rollup-fn17z9f, r=Dylan-DPC

rustc --version --verbose:

rustc 1.74.0-nightly (8550f15e1 2023-08-27)
binary: rustc
commit-hash: 8550f15e148407159af401e02b1d9259762b3496
commit-date: 2023-08-27
host: x86_64-unknown-linux-gnu
release: 1.74.0-nightly
LLVM version: 17.0.0

Backtrace

Backtrace

N/A

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged

@DaniPopes DaniPopes added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Aug 29, 2023
@rustbot rustbot added needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. I-prioritize Issue: Indicates that prioritization has been requested for this issue. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-untriaged Untriaged performance or correctness regression. labels Aug 29, 2023
@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion).

cc @nikic since this regressed in the LLVM 15 update

@rustbot label -I-prioritize +P-medium +T-compiler

@rustbot rustbot added P-medium Medium priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Aug 29, 2023
@nikic nikic added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Aug 29, 2023
@the8472
Copy link
Member

the8472 commented Aug 29, 2023

I've seen llvm doing these strange 1-byte-at-a-time "vectorizations" in other places too. This might be a more general problem.

In this case we can probably paper over it by implementing TrustedRandomAccess for the array iter.

@saethlin saethlin removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Aug 29, 2023
@scottmcm
Copy link
Member

Unfortunately, the array iterator is fundamentally worse right now, because indexing into it -- a fundamental part of being a by-value array iterator -- keeps it from SRoAing and thus doesn't optimize well.

This is why when I was making array::map better I ended up making https://github.com/rust-lang/rust/pull/107634/files#diff-d391813c2568a5afd0555e6224d5168993b6ada448c4f1397799cb98b5b8a18b to not have to use array::IntoIter.

Sadly, for the foreseeable future you're better off using the "weaker" slice iterator when you can.

@DianQK
Copy link
Member

DianQK commented Sep 5, 2023

My guess is that the SROA is missing something in the opaque pointer mode.

@rustbot claim

@DianQK
Copy link
Member

DianQK commented Sep 8, 2023

It's not about vectorization. This optimization can be restored by using -Cllvm-args=-opaque-pointers=0. But highly unrecommended. https://godbolt.org/z/GqdPGarsn

Upstream issue: llvm/llvm-project#65763.

@DianQK DianQK removed their assignment Sep 11, 2023
@DianQK
Copy link
Member

DianQK commented Sep 20, 2023

The upstream issue is closed.
@rustbot claim

@Noratrieb Noratrieb added the llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes label Sep 20, 2023
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 6, 2023
optimize zipping over array iterators

Fixes rust-lang#115339 (somewhat)

the new assembly:

```asm
zip_arrays:
        .cfi_startproc
        vmovups (%rdx), %ymm0
        leaq    32(%rsi), %rcx
        vxorps  %xmm1, %xmm1, %xmm1
        vmovups %xmm1, -24(%rsp)
        movq    $0, -8(%rsp)
        movq    %rsi, -88(%rsp)
        movq    %rdi, %rax
        movq    %rcx, -80(%rsp)
        vmovups %ymm0, -72(%rsp)
        movq    $0, -40(%rsp)
        movq    $32, -32(%rsp)
        movq    -24(%rsp), %rcx
        vmovups (%rsi,%rcx), %ymm0
        vorps   -72(%rsp,%rcx), %ymm0, %ymm0
        vmovups %ymm0, (%rsi,%rcx)
        vmovups (%rsi), %ymm0
        vmovups %ymm0, (%rdi)
        vzeroupper
        retq
```

This is still longer than the slice version given in the issue but at least it eliminates the terrible  `vpextrb`/`orb` chain. I guess this is due to excessive memcpys again (haven't looked at the llvmir)?

The `TrustedLen` specialization is a drive-by change since I had to do something for the default impl anyway to be able to specialize the `TrustedRandomAccessNoCoerce` impl.
@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
@bors bors closed this as completed in 0d410be Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants