Simplify structure of bignum inline assembly #5706

hanno-becker · 2022-04-07T09:03:54Z

Dependencies: ~~Based on #5701~~ merged

This PR follows @mpg's patch in #5360 to simplify organization and use of our bignum inline assembly routines:

MULADDC_X1_[INIT|CORE|STOP] must be defined
MULADDC_X[2,4,8]_[INIT|CORE|STOP] can be defined if it offers speedup potential. If it is not defined, it is auto-derived from the next lower MULADDC macros.
Instances of MULADDC_HUIT are thus renamed to MULADDC_X8_CORE.
The x8-variants did not have their own STOP-macros previously, which made the code harder to read. This gets simpler now that every MULADDC_Xi_xxx family has their own INIT and STOP.
The core multiplication routine mpi_mul_hlp() is simplified to always call the 8-fold code and then the 1-fold code. Since MULADDC_X8_xxx is always defined (even if it's just auto-derived from MULADDC_X1_xxx), this makes the core more readable.

As a concrete performance improvement based on those changes, MULADDC_X2_xxx is introduced for M-profile MCUs implementing UMAAL.

hanno-becker · 2022-04-11T05:18:41Z

(Removed comment which was actually about parent PR)

mpg · 2022-04-15T08:34:53Z

(Changed the base branch twice in order to force-refresh github's commit list, now that #5701 has been merged.)

mpg

LGTM, however the CI doesn't agree. Also, did you test the changes to the UMAAL-based assembly? (Including with an operand size that doesn't just exercise the X2 part but also X1.)

Signed-off-by: Hanno Becker <hanno.becker@arm.com>

Compilers are likely to generate shorter assembly for loops of the form `while( cnt-- ) { ... }` rather than `for( ; count >= X; count -= X ) { ... }`. (E.g. the latter needs a subtract+compare+branch after each loop, while the former only needs decrement+branch). Signed-off-by: Hanno Becker <hanno.becker@arm.com>

Signed-off-by: Hanno Becker <hanno.becker@arm.com>

hanno-becker · 2022-04-19T05:09:08Z

@mpg CI is happy now, I assume it failed because it was still based on an old version of the base PR #5701

mpg · 2022-04-19T08:01:31Z

Have you manually tested the changes to the UMAAL-based assembly? Because the CI didn't. I'm happy with the code but holding my approval until this has been clarified.

hanno-becker · 2022-04-19T09:03:25Z

@mpg Yes, I manually tested the UMAAL assembly, both the x2 and the x1 form.

mpg

LGTM

hanno-becker · 2022-04-19T09:06:59Z

The x2-form is about 15% faster than x1 on Cortex-M4, which is in the expected range since we are saving 1/2 memory stall per UMAAL pair.

tom-cosgrove-arm

LGTM

gilles-peskine-arm · 2022-05-13T14:45:46Z

library/bignum.c


-    for( ; s_len > 0; s_len-- )
+    while( steps_x8-- )
    {


Does this resolve #1717, or are there further savings to be had?

Well, we could always save more code size by doing no unrolling at all, but that would come at a performance cost, at least for some platforms and some applications. I measured (code size, perf) 3 options on 3 platforms and my conclusion was that keeping 8-1 was a good compromise between code size, perf, and simplicity (there are platforms with optimized asm for 8 steps at once, so we need 8-1 for them, and it's simpler to have the same steps on all platforms).

We can always fine-tune things later, but my personal opinion is that most of the savings were had by removing the extra 16x loop, and we should leave it at that for now.

hanno-becker requested review from mpg and tom-cosgrove-arm April 7, 2022 09:03

hanno-becker force-pushed the bn_mul_cleanup branch 2 times, most recently from 8b5cfc5 to be109c0 Compare April 7, 2022 09:11

hanno-becker added component-crypto Crypto primitives and low-level interfaces needs-ci Needs to pass CI tests labels Apr 7, 2022

mpg mentioned this pull request Apr 7, 2022

Improve code/perf trade-off in bignum multiplication #5373

Closed

hanno-becker force-pushed the bn_mul_cleanup branch 2 times, most recently from ba71c38 to 85bb975 Compare April 11, 2022 12:47

hanno-becker mentioned this pull request Apr 12, 2022

Make size of output in mpi_mul_hlp() explicit #5701

Merged

hanno-becker force-pushed the bn_mul_cleanup branch from 85bb975 to a64c167 Compare April 12, 2022 20:47

tom-cosgrove-arm added the needs-preceding-pr Requires another PR to be merged first label Apr 12, 2022

mpg changed the base branch from development to coverity_scan April 15, 2022 08:27

mpg changed the base branch from coverity_scan to development April 15, 2022 08:27

mpg removed the needs-preceding-pr Requires another PR to be merged first label Apr 15, 2022

mpg previously approved these changes Apr 15, 2022

View reviewed changes

Hanno Becker added 3 commits April 17, 2022 06:16

Simplify organization of inline assembly for bignum

eacf3b9

Signed-off-by: Hanno Becker <hanno.becker@arm.com>

Add 2-fold unrolled assembly for umaal based multiplication

d46d96c

Signed-off-by: Hanno Becker <hanno.becker@arm.com>

hanno-becker dismissed mpg’s stale review via d46d96c April 17, 2022 05:20

hanno-becker force-pushed the bn_mul_cleanup branch from a64c167 to d46d96c Compare April 17, 2022 05:20

Add comment explaining structure of UMAAL assembly

606cb16

Signed-off-by: Hanno Becker <hanno.becker@arm.com>

hanno-becker force-pushed the bn_mul_cleanup branch from 6dff41c to 606cb16 Compare April 17, 2022 05:59

hanno-becker requested a review from mpg April 19, 2022 05:09

mpg added needs-review Every commit must be reviewed by at least two team members, and removed needs-ci Needs to pass CI tests labels Apr 19, 2022

mpg approved these changes Apr 19, 2022

View reviewed changes

tom-cosgrove-arm approved these changes Apr 19, 2022

View reviewed changes

mpg added approved Design and code approved - may be waiting for CI or backports and removed needs-review Every commit must be reviewed by at least two team members, labels Apr 19, 2022

mpg merged commit 46435f0 into Mbed-TLS:development Apr 19, 2022

tom-cosgrove-arm mentioned this pull request Apr 27, 2022

Improve inline assembly for Cortex-M + DSP #5360

Closed

gilles-peskine-arm mentioned this pull request May 13, 2022

mpi_mul_hlp code size is huge #1717

Open

gilles-peskine-arm reviewed May 13, 2022

View reviewed changes

tom-cosgrove-arm mentioned this pull request Jul 15, 2022

Building v3.2.1 on ARM fails #6089

Closed

tuxuser mentioned this pull request Jul 23, 2022

PR #5706 broke building for ARM Cortex A-9 VFPv3 #6124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify structure of bignum inline assembly #5706

Simplify structure of bignum inline assembly #5706

hanno-becker commented Apr 7, 2022 •

edited

Loading

hanno-becker commented Apr 11, 2022 •

edited

Loading

mpg commented Apr 15, 2022

mpg left a comment

hanno-becker commented Apr 19, 2022

mpg commented Apr 19, 2022

hanno-becker commented Apr 19, 2022

mpg left a comment

hanno-becker commented Apr 19, 2022

tom-cosgrove-arm left a comment

gilles-peskine-arm May 13, 2022

mpg May 16, 2022 •

edited

Loading

Simplify structure of bignum inline assembly #5706

Simplify structure of bignum inline assembly #5706

Conversation

hanno-becker commented Apr 7, 2022 • edited Loading

hanno-becker commented Apr 11, 2022 • edited Loading

mpg commented Apr 15, 2022

mpg left a comment

Choose a reason for hiding this comment

hanno-becker commented Apr 19, 2022

mpg commented Apr 19, 2022

hanno-becker commented Apr 19, 2022

mpg left a comment

Choose a reason for hiding this comment

hanno-becker commented Apr 19, 2022

tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

gilles-peskine-arm May 13, 2022

Choose a reason for hiding this comment

mpg May 16, 2022 • edited Loading

Choose a reason for hiding this comment

hanno-becker commented Apr 7, 2022 •

edited

Loading

hanno-becker commented Apr 11, 2022 •

edited

Loading

mpg May 16, 2022 •

edited

Loading