mpi_mul_hlp code size is huge #1717

jlaitine · 2018-06-11T09:36:21Z

Description

Type: Enhancement
Priority: Minor

mbed TLS build:
Version: 2.10.0 , git commit c041435

Enhancement\Feature Request

Justification - why does the library need this feature?
We are building mbedtls for signature checking for an embedded device (arm cortex core), and found out that the current code in library/bignum.c: mpi_mul_hlp() unrolls a loop big time like this:


    for( ; i >= 16; i -= 16 )
    {
        MULADDC_INIT
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE

        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_STOP
    }

    for( ; i >= 8; i -= 8 )
    {
        MULADDC_INIT
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE

        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_STOP
    }
    for( ; i > 0; i-- )
    {
        MULADDC_INIT
        MULADDC_CORE
        MULADDC_STOP
    }

This generates quite big code, no matter of the compiler optimizations. We'd like to have an option to define out all but the last loop, which would then leave the unrolling to the compiler. We can save around 4kB of .text with this fix, which is > 25% of all the code that we need for signature checking.

Suggested enhancement

Something like this:

#else /* MULADDC_HUIT */
+#if !defined(MBEDTLS_SIZE_OPTIMIZED_MULADDC)
    for( ; i >= 16; i -= 16 )
    {
        MULADDC_INIT
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE

        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_STOP
    }

    for( ; i >= 8; i -= 8 )
    {
        MULADDC_INIT
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE

        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_STOP
    }
+#endif
    for( ; i > 0; i-- )
    {
        MULADDC_INIT
        MULADDC_CORE
        MULADDC_STOP
    }
#endif /* MULADDC_HUIT */

The text was updated successfully, but these errors were encountered:

jlaitine · 2018-06-11T09:57:34Z

I created a pull request showing what I am using locally:
#1718

pkolbus · 2018-12-20T13:46:05Z

Before seeing this issue, I wrote a similar patch as I also have interest for this. Although the benefit is less with #1618, there's still value on Cortex-M4. Are there any plans to move PR #1718 forward? Would you be interested in a PR from me?

jlaitine · 2018-12-21T09:09:16Z

The CLA signing process, what would have been needed in order to integrate my patch, got stuck somewhere in the company I work for, so my patch probably won't get integrated.

As you have done the similar thing already before seeing my patch, please do make a pull request, I guess this would benefit many people with memory-constrained systems.

gilles-peskine-arm · 2022-05-13T14:45:00Z

#5706 has changed the structure of bignum multiplication to have less manual unrolling. For arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m0plus (7-2018-q3-update), bignum.o (including functions only used for RSA and not for ECC) goes down from 12425 bytes to 11421.

@hanno-arm Is there more to gain there? Or can we consider this issue to be resolved?

mpg · 2022-05-16T07:42:37Z

See also #5360 (comment) for some data.

I'm not super inclined to add yet another compile time option, so I'd like to find a compromise that's acceptable for most platforms, and I think 8-1 is a good compromise.

tom-cosgrove-arm · 2023-07-11T15:01:26Z

The code in this area has changed quite a bit, so #1718 was closed, and other changes in #5373 also closed, until eventually the changes in #5706 were merged.

With that, the best change that could be made would be

diff --git a/library/bignum_core.c b/library/bignum_core.c
index de57cfc04..f272a0c41 100644
--- a/library/bignum_core.c
+++ b/library/bignum_core.c
@@ -472,6 +472,8 @@ mbedtls_mpi_uint mbedtls_mpi_core_mla(mbedtls_mpi_uint *d, size_t d_len,
         s_len = d_len;
     }
     size_t excess_len = d_len - s_len;
+
+#if !defined(__OPTIMIZE_SIZE__)
     size_t steps_x8 = s_len / 8;
     size_t steps_x1 = s_len & 7;
 
@@ -480,6 +482,9 @@ mbedtls_mpi_uint mbedtls_mpi_core_mla(mbedtls_mpi_uint *d, size_t d_len,
         MULADDC_X8_CORE
             MULADDC_X8_STOP
     }
+#else
+    size_t steps_x1 = s_len;
+#endif
 
     while (steps_x1--) {
         MULADDC_X1_INIT

This would save 236 bytes when compiled for the TF-M configuration with ArmClang 6.19 and -Oz --target=arm-arm-none-eabi -mcpu=cortex-m33+nodsp

daverodgman · 2023-07-12T08:24:40Z

From a quick run of benchmark ecdsa ecdh, I see that this function is called with the following values for s_len:

l[1] 12793090
l[2] 102634772
l[3] 206657353
l[4] 492777032
l[5] 24844512
l[6] 144860257
l[7] 58882243
l[8] 24257867
l[9] 178687872
l[12] 34680
l[16] 4152
l[17] 27170

i.e., the most common value is 4 - so, if this is representative of real applications, unrolling X8 doesn't benefit the majority of cases and we would probably see better perf as well as smaller code size by reducing the unroll amount to 4 (or maybe 2 or 3).

gilles-peskine-arm · 2023-07-12T08:36:54Z

The distribution of s_len values in benchmark ecdh ecdsa is mostly an artifact of the selection of curves. With set of default curves, there are 2 curves with len=3, 5 curves with len=4, 2 curves with len=6, 1 curve with len=7, 1 curve with len=8, 1 curve with len=9. The distribution is not proportional to the number of curves because different types of curves require different numbers of multiplications, and the number is not linear in the size of the curve even for a given type. Many applications will only use one or two curves and so a more focused distribution of lengths. For example, a build with only secp256r1 will rarely need more than len=4, whereas a build with secp521r1 cares about len=9. Also, RSA lengths are much bigger.

tom-cosgrove-arm · 2023-07-12T08:49:34Z

But

$ programs/test/benchmark rsa

  RSA-2048                 :   25601  public/s
  RSA-2048                 :     540 private/s
  RSA-4096                 :    6577  public/s
  RSA-4096                 :      87 private/s

count[1] = 7726176
count[2] = 449763
count[16] = 87893176
count[17] = 87748530
count[18] = 52832
count[32] = 117512722
count[33] = 109931554
count[34] = 2530528
count[64] = 28220034
count[65] = 24381056
count[66] = 1279680

tom-cosgrove-arm · 2023-07-12T08:51:33Z

There is clearly some worthwhile size optimisation that can be done if we only have 256-bit curves and/or 384-bit curves (picking "most common" and "NIST-recommended" sizes there)

tom-cosgrove-arm · 2023-08-08T08:00:03Z

This really needs some performance tests before we can move forward with it

jlaitine mentioned this issue Jun 11, 2018

Add a configuration option to optimize mpi_mul_hlp for size #1718

Closed

RonEld added enhancement fix available labels Jun 11, 2018

simonbutcher added the component-crypto Crypto primitives and low-level interfaces label Jun 13, 2018

daverodgman removed the fix available label May 13, 2022

gilles-peskine-arm mentioned this issue May 13, 2022

Simplify structure of bignum inline assembly #5706

Merged

aditya-deshpande-arm added the historical-reviewing Currently reviewing (for legacy PR/issues) label Feb 9, 2023

tom-cosgrove-arm added the size-optimisation label Feb 10, 2023

aditya-deshpande-arm added historical-reviewed Reviewed & agreed to keep legacy PR/issue and removed historical-reviewing Currently reviewing (for legacy PR/issues) labels Feb 10, 2023

github-project-automation bot added this to Backlog for Mbed TLS Aug 30, 2024

github-project-automation bot moved this to Code Size Optimisation in Backlog for Mbed TLS Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpi_mul_hlp code size is huge #1717

mpi_mul_hlp code size is huge #1717

jlaitine commented Jun 11, 2018 •

edited

Loading

jlaitine commented Jun 11, 2018

pkolbus commented Dec 20, 2018

jlaitine commented Dec 21, 2018

gilles-peskine-arm commented May 13, 2022

mpg commented May 16, 2022

tom-cosgrove-arm commented Jul 11, 2023

daverodgman commented Jul 12, 2023

gilles-peskine-arm commented Jul 12, 2023

tom-cosgrove-arm commented Jul 12, 2023

tom-cosgrove-arm commented Jul 12, 2023

tom-cosgrove-arm commented Aug 8, 2023

mpi_mul_hlp code size is huge #1717

mpi_mul_hlp code size is huge #1717

Comments

jlaitine commented Jun 11, 2018 • edited Loading

Description

Enhancement\Feature Request

jlaitine commented Jun 11, 2018

pkolbus commented Dec 20, 2018

jlaitine commented Dec 21, 2018

gilles-peskine-arm commented May 13, 2022

mpg commented May 16, 2022

tom-cosgrove-arm commented Jul 11, 2023

daverodgman commented Jul 12, 2023

gilles-peskine-arm commented Jul 12, 2023

tom-cosgrove-arm commented Jul 12, 2023

tom-cosgrove-arm commented Jul 12, 2023

tom-cosgrove-arm commented Aug 8, 2023

jlaitine commented Jun 11, 2018 •

edited

Loading