-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-temporal load/store memory routines? #68
Comments
Memcpy is highly tuned for small sizes because vast majority of copies are small (>95% is less than 256 bytes). Modern CPUs will automatically prefetch and use write streaming, so there is no use for trying to detect cache size and tuning memcpy that way. |
yeah, i don't have the relevant hardware to test this. the equivalent intel routines all have the equivalent hack though. i think the real problem is that popular benchmarks like to do large clears/copies. i have no reason to believe these optimizations are meant for anything relevant to actual users, but as long as reviewers use meaningless benchmarks, SoC vendors will chase those meaningless benchmarks. |
Yes those kind of benchmarks are useless, but we still don't need cache size hacks like x86. Standard load/store instructions already get the maximum bandwidth across L1, L2, L3 and DRAM. Why would you make standard loads/stores slower and force people to use specialized instructions? |
i'm not trolling -- i actually don't know the answer to this -- but "why does arm64 have those instructions then?". i've personally never seen them used for anything except this (on intel either). you can't express this except in assembler either, right? so even if you had some more general algorithm that wanted to avoid churning the cache with rubbish, you wouldn't be able to get the effect without rewriting it in assembler? or is it just meant for kernel page-clearing use or something? |
It's like software prefetching - there are enough people who claim they are useful, so they tend to get added to ISAs. However user prefetches typically use an incorrect prefetch distance, are overused, take up precious load/store resources and may interfere with hardware prefetch. Hardware prefetchers do a better job, so many CPUs just drop user prefetches... |
we just had this submission to aosp from qcom: https://android-review.googlesource.com/c/platform/bionic/+/3002758
it seems to be your existing
string/aarch64/memcpy.S
but with an extra "if > 48KiB, use non-temporal loads/stores" case.i'm guessing that this kind of thing is likely to be problematic because userspace can't easily query the cache size on arm64[1], so we'd need a bunch of different copies of this for different cache sizes? or a global that libc writes and optimized-routines reads?
The text was updated successfully, but these errors were encountered: