Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi_mul_hlp code size is huge #1717

Open
jlaitine opened this issue Jun 11, 2018 · 11 comments
Open

mpi_mul_hlp code size is huge #1717

jlaitine opened this issue Jun 11, 2018 · 11 comments
Labels
component-crypto Crypto primitives and low-level interfaces enhancement historical-reviewed Reviewed & agreed to keep legacy PR/issue size-optimisation

Comments

@jlaitine
Copy link

jlaitine commented Jun 11, 2018

Description

  • Type: Enhancement
  • Priority: Minor

mbed TLS build:
Version: 2.10.0 , git commit c041435

Enhancement\Feature Request

Justification - why does the library need this feature?
We are building mbedtls for signature checking for an embedded device (arm cortex core), and found out that the current code in library/bignum.c: mpi_mul_hlp() unrolls a loop big time like this:


    for( ; i >= 16; i -= 16 )
    {
        MULADDC_INIT
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE

        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_STOP
    }

    for( ; i >= 8; i -= 8 )
    {
        MULADDC_INIT
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE

        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_STOP
    }
    for( ; i > 0; i-- )
    {
        MULADDC_INIT
        MULADDC_CORE
        MULADDC_STOP
    }

This generates quite big code, no matter of the compiler optimizations. We'd like to have an option to define out all but the last loop, which would then leave the unrolling to the compiler. We can save around 4kB of .text with this fix, which is > 25% of all the code that we need for signature checking.

Suggested enhancement

Something like this:

#else /* MULADDC_HUIT */
+#if !defined(MBEDTLS_SIZE_OPTIMIZED_MULADDC)
    for( ; i >= 16; i -= 16 )
    {
        MULADDC_INIT
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE

        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_STOP
    }

    for( ; i >= 8; i -= 8 )
    {
        MULADDC_INIT
        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE

        MULADDC_CORE   MULADDC_CORE
        MULADDC_CORE   MULADDC_CORE
        MULADDC_STOP
    }
+#endif
    for( ; i > 0; i-- )
    {
        MULADDC_INIT
        MULADDC_CORE
        MULADDC_STOP
    }
#endif /* MULADDC_HUIT */
@jlaitine
Copy link
Author

I created a pull request showing what I am using locally:
#1718

@pkolbus
Copy link
Contributor

pkolbus commented Dec 20, 2018

Before seeing this issue, I wrote a similar patch as I also have interest for this. Although the benefit is less with #1618, there's still value on Cortex-M4. Are there any plans to move PR #1718 forward? Would you be interested in a PR from me?

@jlaitine
Copy link
Author

The CLA signing process, what would have been needed in order to integrate my patch, got stuck somewhere in the company I work for, so my patch probably won't get integrated.

As you have done the similar thing already before seeing my patch, please do make a pull request, I guess this would benefit many people with memory-constrained systems.

@gilles-peskine-arm
Copy link
Contributor

#5706 has changed the structure of bignum multiplication to have less manual unrolling. For arm-none-eabi-gcc -Os -mthumb -mcpu=cortex-m0plus (7-2018-q3-update), bignum.o (including functions only used for RSA and not for ECC) goes down from 12425 bytes to 11421.

@hanno-arm Is there more to gain there? Or can we consider this issue to be resolved?

@mpg
Copy link
Contributor

mpg commented May 16, 2022

See also #5360 (comment) for some data.

I'm not super inclined to add yet another compile time option, so I'd like to find a compromise that's acceptable for most platforms, and I think 8-1 is a good compromise.

@aditya-deshpande-arm aditya-deshpande-arm added the historical-reviewing Currently reviewing (for legacy PR/issues) label Feb 9, 2023
@aditya-deshpande-arm aditya-deshpande-arm added historical-reviewed Reviewed & agreed to keep legacy PR/issue and removed historical-reviewing Currently reviewing (for legacy PR/issues) labels Feb 10, 2023
@tom-cosgrove-arm
Copy link
Contributor

The code in this area has changed quite a bit, so #1718 was closed, and other changes in #5373 also closed, until eventually the changes in #5706 were merged.

With that, the best change that could be made would be

diff --git a/library/bignum_core.c b/library/bignum_core.c
index de57cfc04..f272a0c41 100644
--- a/library/bignum_core.c
+++ b/library/bignum_core.c
@@ -472,6 +472,8 @@ mbedtls_mpi_uint mbedtls_mpi_core_mla(mbedtls_mpi_uint *d, size_t d_len,
         s_len = d_len;
     }
     size_t excess_len = d_len - s_len;
+
+#if !defined(__OPTIMIZE_SIZE__)
     size_t steps_x8 = s_len / 8;
     size_t steps_x1 = s_len & 7;
 
@@ -480,6 +482,9 @@ mbedtls_mpi_uint mbedtls_mpi_core_mla(mbedtls_mpi_uint *d, size_t d_len,
         MULADDC_X8_CORE
             MULADDC_X8_STOP
     }
+#else
+    size_t steps_x1 = s_len;
+#endif
 
     while (steps_x1--) {
         MULADDC_X1_INIT

This would save 236 bytes when compiled for the TF-M configuration with ArmClang 6.19 and -Oz --target=arm-arm-none-eabi -mcpu=cortex-m33+nodsp

@daverodgman
Copy link
Contributor

From a quick run of benchmark ecdsa ecdh, I see that this function is called with the following values for s_len:

l[1] 12793090
l[2] 102634772
l[3] 206657353
l[4] 492777032
l[5] 24844512
l[6] 144860257
l[7] 58882243
l[8] 24257867
l[9] 178687872
l[12] 34680
l[16] 4152
l[17] 27170

i.e., the most common value is 4 - so, if this is representative of real applications, unrolling X8 doesn't benefit the majority of cases and we would probably see better perf as well as smaller code size by reducing the unroll amount to 4 (or maybe 2 or 3).

@gilles-peskine-arm
Copy link
Contributor

The distribution of s_len values in benchmark ecdh ecdsa is mostly an artifact of the selection of curves. With set of default curves, there are 2 curves with len=3, 5 curves with len=4, 2 curves with len=6, 1 curve with len=7, 1 curve with len=8, 1 curve with len=9. The distribution is not proportional to the number of curves because different types of curves require different numbers of multiplications, and the number is not linear in the size of the curve even for a given type. Many applications will only use one or two curves and so a more focused distribution of lengths. For example, a build with only secp256r1 will rarely need more than len=4, whereas a build with secp521r1 cares about len=9. Also, RSA lengths are much bigger.

@tom-cosgrove-arm
Copy link
Contributor

But

$ programs/test/benchmark rsa

  RSA-2048                 :   25601  public/s
  RSA-2048                 :     540 private/s
  RSA-4096                 :    6577  public/s
  RSA-4096                 :      87 private/s

count[1] = 7726176
count[2] = 449763
count[16] = 87893176
count[17] = 87748530
count[18] = 52832
count[32] = 117512722
count[33] = 109931554
count[34] = 2530528
count[64] = 28220034
count[65] = 24381056
count[66] = 1279680

@tom-cosgrove-arm
Copy link
Contributor

There is clearly some worthwhile size optimisation that can be done if we only have 256-bit curves and/or 384-bit curves (picking "most common" and "NIST-recommended" sizes there)

@tom-cosgrove-arm
Copy link
Contributor

This really needs some performance tests before we can move forward with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component-crypto Crypto primitives and low-level interfaces enhancement historical-reviewed Reviewed & agreed to keep legacy PR/issue size-optimisation
Projects
Status: Code Size Optimisation
Development

No branches or pull requests

9 participants