non-temporal load/store memory routines? #68

enh-google · 2024-03-18T18:29:46Z

we just had this submission to aosp from qcom: https://android-review.googlesource.com/c/platform/bionic/+/3002758

it seems to be your existing string/aarch64/memcpy.S but with an extra "if > 48KiB, use non-temporal loads/stores" case.

i'm guessing that this kind of thing is likely to be problematic because userspace can't easily query the cache size on arm64[1], so we'd need a bunch of different copies of this for different cache sizes? or a global that libc writes and optimized-routines reads?

if i'm wrong, let me know how: i'd love to fix bionic's arm64 sysconf() to match the other architectures!

The text was updated successfully, but these errors were encountered:

Wilco1 · 2024-04-09T13:03:36Z

Memcpy is highly tuned for small sizes because vast majority of copies are small (>95% is less than 256 bytes). Modern CPUs will automatically prefetch and use write streaming, so there is no use for trying to detect cache size and tuning memcpy that way.

enh-google · 2024-04-09T22:56:29Z

yeah, i don't have the relevant hardware to test this. the equivalent intel routines all have the equivalent hack though.

i think the real problem is that popular benchmarks like to do large clears/copies. i have no reason to believe these optimizations are meant for anything relevant to actual users, but as long as reviewers use meaningless benchmarks, SoC vendors will chase those meaningless benchmarks.

Wilco1 · 2024-04-10T12:27:49Z

Yes those kind of benchmarks are useless, but we still don't need cache size hacks like x86. Standard load/store instructions already get the maximum bandwidth across L1, L2, L3 and DRAM.

Why would you make standard loads/stores slower and force people to use specialized instructions?

enh-google · 2024-04-11T00:13:41Z

i'm not trolling -- i actually don't know the answer to this -- but "why does arm64 have those instructions then?". i've personally never seen them used for anything except this (on intel either). you can't express this except in assembler either, right? so even if you had some more general algorithm that wanted to avoid churning the cache with rubbish, you wouldn't be able to get the effect without rewriting it in assembler?

or is it just meant for kernel page-clearing use or something?

Wilco1 · 2024-04-11T10:46:24Z

It's like software prefetching - there are enough people who claim they are useful, so they tend to get added to ISAs. However user prefetches typically use an incorrect prefetch distance, are overused, take up precious load/store resources and may interfere with hardware prefetch. Hardware prefetchers do a better job, so many CPUs just drop user prefetches...

enh-google mentioned this issue Apr 10, 2024

[libc][docs] codify Policy on Assembler Sources llvm/llvm-project#88185

Merged

Wilco1 closed this as completed Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-temporal load/store memory routines? #68

non-temporal load/store memory routines? #68

enh-google commented Mar 18, 2024

Wilco1 commented Apr 9, 2024

enh-google commented Apr 9, 2024

Wilco1 commented Apr 10, 2024

enh-google commented Apr 11, 2024

Wilco1 commented Apr 11, 2024

non-temporal load/store memory routines? #68

non-temporal load/store memory routines? #68

Comments

enh-google commented Mar 18, 2024

Wilco1 commented Apr 9, 2024

enh-google commented Apr 9, 2024

Wilco1 commented Apr 10, 2024

enh-google commented Apr 11, 2024

Wilco1 commented Apr 11, 2024