Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-temporal load/store memory routines? #68

Closed
enh-google opened this issue Mar 18, 2024 · 5 comments
Closed

non-temporal load/store memory routines? #68

enh-google opened this issue Mar 18, 2024 · 5 comments

Comments

@enh-google
Copy link

we just had this submission to aosp from qcom: https://android-review.googlesource.com/c/platform/bionic/+/3002758

it seems to be your existing string/aarch64/memcpy.S but with an extra "if > 48KiB, use non-temporal loads/stores" case.

i'm guessing that this kind of thing is likely to be problematic because userspace can't easily query the cache size on arm64[1], so we'd need a bunch of different copies of this for different cache sizes? or a global that libc writes and optimized-routines reads?


  1. if i'm wrong, let me know how: i'd love to fix bionic's arm64 sysconf() to match the other architectures!
@Wilco1
Copy link
Contributor

Wilco1 commented Apr 9, 2024

Memcpy is highly tuned for small sizes because vast majority of copies are small (>95% is less than 256 bytes). Modern CPUs will automatically prefetch and use write streaming, so there is no use for trying to detect cache size and tuning memcpy that way.

@enh-google
Copy link
Author

yeah, i don't have the relevant hardware to test this. the equivalent intel routines all have the equivalent hack though.

i think the real problem is that popular benchmarks like to do large clears/copies. i have no reason to believe these optimizations are meant for anything relevant to actual users, but as long as reviewers use meaningless benchmarks, SoC vendors will chase those meaningless benchmarks.

@Wilco1
Copy link
Contributor

Wilco1 commented Apr 10, 2024

Yes those kind of benchmarks are useless, but we still don't need cache size hacks like x86. Standard load/store instructions already get the maximum bandwidth across L1, L2, L3 and DRAM.

Why would you make standard loads/stores slower and force people to use specialized instructions?

@enh-google
Copy link
Author

i'm not trolling -- i actually don't know the answer to this -- but "why does arm64 have those instructions then?". i've personally never seen them used for anything except this (on intel either). you can't express this except in assembler either, right? so even if you had some more general algorithm that wanted to avoid churning the cache with rubbish, you wouldn't be able to get the effect without rewriting it in assembler?

or is it just meant for kernel page-clearing use or something?

@Wilco1
Copy link
Contributor

Wilco1 commented Apr 11, 2024

It's like software prefetching - there are enough people who claim they are useful, so they tend to get added to ISAs. However user prefetches typically use an incorrect prefetch distance, are overused, take up precious load/store resources and may interfere with hardware prefetch. Hardware prefetchers do a better job, so many CPUs just drop user prefetches...

@Wilco1 Wilco1 closed this as completed Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants