-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify structure of bignum inline assembly #5706
Conversation
8b5cfc5
to
be109c0
Compare
(Removed comment which was actually about parent PR) |
ba71c38
to
85bb975
Compare
85bb975
to
a64c167
Compare
(Changed the base branch twice in order to force-refresh github's commit list, now that #5701 has been merged.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however the CI doesn't agree. Also, did you test the changes to the UMAAL-based assembly? (Including with an operand size that doesn't just exercise the X2 part but also X1.)
Signed-off-by: Hanno Becker <hanno.becker@arm.com>
Compilers are likely to generate shorter assembly for loops of the form `while( cnt-- ) { ... }` rather than `for( ; count >= X; count -= X ) { ... }`. (E.g. the latter needs a subtract+compare+branch after each loop, while the former only needs decrement+branch). Signed-off-by: Hanno Becker <hanno.becker@arm.com>
Signed-off-by: Hanno Becker <hanno.becker@arm.com>
a64c167
to
d46d96c
Compare
Signed-off-by: Hanno Becker <hanno.becker@arm.com>
6dff41c
to
606cb16
Compare
Have you manually tested the changes to the |
@mpg Yes, I manually tested the UMAAL assembly, both the x2 and the x1 form. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The x2-form is about 15% faster than x1 on Cortex-M4, which is in the expected range since we are saving 1/2 memory stall per UMAAL pair. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
for( ; s_len > 0; s_len-- ) | ||
while( steps_x8-- ) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this resolve #1717, or are there further savings to be had?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we could always save more code size by doing no unrolling at all, but that would come at a performance cost, at least for some platforms and some applications. I measured (code size, perf) 3 options on 3 platforms and my conclusion was that keeping 8-1 was a good compromise between code size, perf, and simplicity (there are platforms with optimized asm for 8 steps at once, so we need 8-1 for them, and it's simpler to have the same steps on all platforms).
We can always fine-tune things later, but my personal opinion is that most of the savings were had by removing the extra 16x loop, and we should leave it at that for now.
Dependencies:
Based on #5701mergedThis PR follows @mpg's patch in #5360 to simplify organization and use of our bignum inline assembly routines:
MULADDC_X1_[INIT|CORE|STOP]
must be definedMULADDC_X[2,4,8]_[INIT|CORE|STOP]
can be defined if it offers speedup potential. If it is not defined, it is auto-derived from the next lowerMULADDC
macros.MULADDC_HUIT
are thus renamed toMULADDC_X8_CORE
.STOP
-macros previously, which made the code harder to read. This gets simpler now that everyMULADDC_Xi_xxx
family has their ownINIT
andSTOP
.mpi_mul_hlp()
is simplified to always call the 8-fold code and then the 1-fold code. SinceMULADDC_X8_xxx
is always defined (even if it's just auto-derived fromMULADDC_X1_xxx
), this makes the core more readable.As a concrete performance improvement based on those changes,
MULADDC_X2_xxx
is introduced for M-profile MCUs implementingUMAAL
.