-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimizing division-by-constant on AArch32(ARM32) #63731
Comments
It looks like this form of umulh utilizing both umull and umlal might be shorter, but I haven't benchmarked or tested it:
|
@llvm/issue-subscribers-backend-arm |
clang 16 generates (https://c.godbolt.org/z/1WxqanGq4) (see 38ffa2b):
But we only do this transform for certain constants. And your suggested sequence appears to be shorter. |
Neat stuff! I didn't realize it know div3, so I'm afraid I over-simplified my example. I need to divide a uint64_t by 1e6(seconds+nanoseconds to milliseconds) or 1e3(seconds+nanoseconds to microseconds). I put together a fuller milliseconds example with both approaches(https://godbolt.org/z/qKnEr18f6):
|
Another random data point: On Raspberry Pi 4, GCC10 -O3 -m32, a realtime thread on a low-interrupt core can run ts_to_millis_umulh() about 36936656 times per second. Using ts_to_millis_c(), takes about 23x longer, because it's a 32-bit target and needs to maintain binary compatibility Pi0/1, which use ARM1176 without division instructions. |
I found both GCC12 and Clang11 call __aeabi_uldivmod when dividing a uint64_t by a constant on AArch32, and I found it's possible to emulate the AArch64 umulh instruction which enables division by a constant optimization even on AArch32. I wrote this up here and will put the relevant bits below, there's a GCC bugzilla suggestion as well. This optimization applies only to Arm cores with umull instruction, which is many of them but not present on the smallest microcontrollers and some older cores.
C code for dividing uint64_t by a constant
Clang11( -O3 -mcpu=cortex-m4):
However, Clang16 for AArch64 implements division by 3 using multiplication by the inverse as:
While AArch32 instruction set like Cortex-M4 does not have the umulth instruction, it does have umull(32bit x 32bit multiply with 64-bit result). Using that, we can emulate umulh with the following function:
I will refer to this as umulh emulation, and in testing it results in a speedup, between 2x and 37x, depending on various factors, on Cortex-M4.
Firstly, the execution time of umulh() is a constant on Cortex-M4, but __aeabi_uldivmod() execution time depends on the numerator.
If the core and compiler-support library implement __aeabi_uldivmod() using the udiv instruction, the speedup for emulating umulh is relatively small, only around 2-4x on Cortex-M4. But if not, the umulh approach can divide numbers at about 28-37x the rate of __aeabi_uldivmod(). Roughly we can group cores as follows:
If this is of interest, it might be worth putting umulh into the compiler support library and using it to divide uint64_t by constants.
The text was updated successfully, but these errors were encountered: