Improve bignum speed on ARM with DSP instructions #1618

aurel32 · 2018-05-08T11:26:16Z

Description

These two patches improve the bignum code (used in particular for RSA computation) on ARM CPU which have the DSP extension, that is the ARMv7E-M like the Cortex M4 and Cortex M7, or the ARMv7-A, like the Cortex A series. It has been originally developed for a Cortex M4 application, but I have done all the tests on a Cortex A8.

Status

READY

Requires Backporting

NO (enhancement)

Migrations

If there is any API change, what's the incentive and logic for it.

NO (this is just a different code path on supported CPU)

RonEld · 2018-05-09T07:40:29Z

@aurel32 Thank you for your contribution!

Out of curiosity, do you have information on the performance improvements this PR adds?

n addition, unfortunately our policy is to not accept contributions, without a Contributor’s Licence Agreement (CLA) signed or authorised by yourself or your employer.
If this is a personal contribution, the easiest way to do this is if you create an mbed account and accept this click through agreement. Alternatively, you can find a slightly different agreement to sign here, which can be signed and returned to us, and is applicable if you don't want to create an mbed account or alternatively if this is a corporate contribution.
Thanks for your understanding and again, thanks for the contribution!

aurel32 · 2018-05-09T21:29:36Z

Out of curiosity, do you have information on the performance improvements this PR adds?

The speed-up are given in the description of each individual commit, this gives a total of +27% for the whole PR, but on a very specific case (RSA signing operation). I have tried to use the testsuite, and most notably test_suite_rsa, but I have found there is too much variability from one run to another (up to a factor 3). Any suggestion to test the performances is welcome.

In addition, unfortunately our policy is to not accept contributions, without a Contributor’s Licence Agreement (CLA) signed or authorised by yourself or your employer.

I have accepted the CLA on os.mbed.com just before sending this pull request.

RonEld · 2018-05-10T06:34:44Z

@aurel32 Thank you for your information, and for signing the CLA!
Have you tried ECC operations as well?

aurel32 · 2018-05-21T21:11:30Z

Sorry about the delay. I took time to rework the patches and to setup a test environment testing ECC signing, RSA signing, RSA encryption and RSA decryption on a Cortex M4 (my original target), a Cortex M7 and a Cortex A8. I have just pushed the new version.

I removed the second patch defining MULADDC_HUIT, as I realized it was giving a small performance penalty for ECC signing. Instead I reworked the ASM constraints and value passing to remove even more instructions.

@RonEld, the ECC signing test shows a small improvement compared to the RSA one. I have included the performance results in the commit message.

The improvements are much higher on the Cortex M than on the Cortex A, likely because there are less memory access penalty. This is especially true on the Cortex M7 as I put the mbedtls heap in TCM memory.

aurel32 · 2018-10-03T12:59:07Z

Any news on that? Do you need me to provide more details?

Patater

The assembly changes look good to me. I had one question about the pre-processor define used.

include/mbedtls/bn_mul.h

The Cortex M4, M7 MCUs and the Cortex A CPUs support the ARM DSP instructions, and especially the umaal instruction which greatly speed up MULADDC code. In addition the patch switched the ASM constraints to registers instead of memory, giving the opportunity for the compiler to load them the best way. The speed improvement is variable depending on the crypto operation and the CPU. Here are the results on a Cortex M4, a Cortex M7 and a Cortex A8. All tests have been done with GCC 6.3 using -O2. RSA uses a RSA-4096 key. ECDSA uses a secp256r1 curve EC key pair. +--------+--------+--------+ | M4 | M7 | A8 | +----------------+--------+--------+--------+ | ECDSA signing | +6.3% | +7.9% | +4.1% | +----------------+--------+--------+--------+ | RSA signing | +43.7% | +68.3% | +26.3% | +----------------+--------+--------+--------+ | RSA encryption | +3.4% | +9.7% | +3.6% | +----------------+--------+--------+--------+ | RSA decryption | +43.0% | +67.8% | +22.8% | +----------------+--------+--------+--------+ I ran the whole testsuite on the Cortex A8 Linux environment, and it all passes.

yanesca

Looks good to me!

aurel32 · 2018-10-04T21:11:20Z

@Patater, @yanesca, thanks for the review. I have just pushed a new version fixing the check for ARM DSP support as pointed by @Patater.

Patater · 2018-10-05T14:24:57Z

retest

aurel32 · 2018-10-11T06:54:45Z

I have seen that both the mbed TLS build & test and the mbed-ci-generic have failed. Unfortunately I am not able to get the details. Would it be possible to get them, so that I can try to provide a fix to the issues found?

RonEld · 2018-10-11T07:13:08Z

The mbed TLS build & test failed on a timing test, which is a known issue in the test.
I will run the tests again

RonEld · 2018-10-11T07:13:20Z

retest

simonbutcher · 2018-10-11T17:27:54Z

retest

simonbutcher · 2018-10-21T15:11:35Z

retest

simonbutcher · 2018-10-24T16:21:52Z

Jenkins passes, approved by @Patater, so ready for merge!

simonbutcher · 2018-10-28T17:31:16Z

Needs Changelog entry to be added at time of merge.

* development: (282 commits) Add ChangeLog entry for PR #1618 - ARM DSP instruction support Detect unsigned integer overflow in mbedtls_ecp_check_budget() Cast number of operations to `uint` in MBEDTLS_ECP_BUDGET Remove merge conflict marker in ssl-opt.sh Fix some documentation typos and improve a comment Fix some typos in documentation and comments Add Jenkinsfile for PR job Adapt ChangeLog Fail when encountering invalid CBC padding in EtM records Add missing return value check in ECDSA test suite Remove yotta support from check-files.py Add a comment to clarify code flow Fix missing dereference. Expand test to ensure no assumption on output Improve readability by moving counter decrement Fix alignment in a macro definition Fix function name to fit conventions Add comment on internal function API Remove unnecessary calls to init() from free() Fix misleading sub-state name and comments ...

zv-io · 2018-11-02T15:42:36Z

After these changes I am having difficulty building it with -mcpu=arm10e on the following targets:

arm-linux-musleabihf
armeb-linux-musleabihf
armel-linux-musleabihf
armv5l-linux-musleabihf
armv7l-linux-musleabihf
armv7r-linux-musleabihf

which are all -mcpu=arm10e -mfloat-abi=hard -mtls-dialect=gnu -marm -march=armv5te+fp

whereas these do work:

arm-linux-musleabi
armeb-linux-musleabi
armel-linux-musleabi
armv7m-linux-musleabi

which are all -mcpu=arm10tdmi -mtls-dialect=gnu -marm -march=armv5t.

In particular, -mcpu=arm10tdmi conflicts with -march=armv5te switch and this processor (-mcpu=arm10tdmi) lacks an FPU, so hard-float configurations are seemingly impossible to use now?

The specific error is:

Error: selected processor does not support `umaal r1,ip,r3,r0' in ARM mode

Some reference materials [PDF]: https://www.digchip.com/datasheets/download_datasheet.php?id=154499&part-number=ARM10200

aurel32 · 2018-11-02T19:25:32Z

Hi @zv-io, sorry for breaking the build on those targets. The intention of the new code was to optimize the muladd macro with DSP instructions when they are available, and leave the old code unchanged when they are not available. I guarded the code with __ARM_FEATURE_DSP, but it happens it's not enough. In your case the ARMv5TE has DSP instructions (that's the meaning of the E), but only in Thumb mode and not in ARM mode. So the issue arise when both -marm and -march=armv5te are used together. The DSP instructions are available in both ARM and Thumb mode in ARMv6 and ARMv7.

I believe it's enough to just disable the use of DSP instructions on ARMv5. I guess the following patch should be enough:

diff --git a/include/mbedtls/bn_mul.h b/include/mbedtls/bn_mul.h
index 0af694c7c..85d1e7172 100644
--- a/include/mbedtls/bn_mul.h
+++ b/include/mbedtls/bn_mul.h
@@ -636,7 +636,7 @@
            "r6", "r7", "r8", "r9", "cc"         \
          );
 
-#elif defined (__ARM_FEATURE_DSP) && (__ARM_FEATURE_DSP == 1)
+#elif (__ARM_ARCH >= 6) && defined (__ARM_FEATURE_DSP) && (__ARM_FEATURE_DSP == 1)
 
 #define MULADDC_INIT                            \
     asm(

Can you please give it a try? If it works, I'll open a pull request.

zv-io · 2018-11-02T21:11:35Z

That looks OK. These are from the arm-linux-musleabihf toolchain. I have not yet tested the remaining hard-float ones above but suspect they'll also work.

#define __ARM_ARCH_ISA_ARM 1
#define __ARM_ARCH_5TE__ 1
#define __ARM_ARCH_ISA_THUMB 1
#define __ARM_ARCH 5
...
#define __ARM_FEATURE_QBIT 1
#define __ARM_FEATURE_CLZ 1
#define __ARM_FEATURE_COPROC 7
#define __ARM_FEATURE_DSP 1

With the patch it builds. I'm not too familiar with all of the intricacies/features of ARM processors (specifically w.r.t. what is compatible and not), so I'd definitely recommend making sure you (and ideally others) review whether that patch is sufficient to prevent any regression from targets that may work but haven't been tested.

RonEld added enhancement CLA requested labels May 9, 2018

RonEld added CLA valid and removed CLA requested labels May 10, 2018

simonbutcher changed the title ~~Improve bignum speed on ARM with DSP instrutctions~~ Improve bignum speed on ARM with DSP instructions May 14, 2018

simonbutcher added the component-crypto Crypto primitives and low-level interfaces label May 18, 2018

aurel32 force-pushed the bn_mul-arm-dsp branch from fc04b58 to a17cedf Compare May 21, 2018 21:01

aurel32 mentioned this pull request Jun 12, 2018

Add a configuration option to optimize mpi_mul_hlp for size #1718

Closed

RonEld requested review from Patater and yanesca July 15, 2018 11:29

Patater reviewed Oct 4, 2018

View reviewed changes

include/mbedtls/bn_mul.h Outdated Show resolved Hide resolved

yanesca previously approved these changes Oct 4, 2018

View reviewed changes

aurel32 dismissed yanesca’s stale review via 16b1bd8 October 4, 2018 21:08

aurel32 force-pushed the bn_mul-arm-dsp branch from a17cedf to 16b1bd8 Compare October 4, 2018 21:08

Patater approved these changes Oct 5, 2018

View reviewed changes

yanesca approved these changes Oct 5, 2018

View reviewed changes

Patater added the needs-ci Needs to pass CI tests label Oct 5, 2018

RonEld added needs-ci Needs to pass CI tests and removed needs-ci Needs to pass CI tests labels Oct 11, 2018

simonbutcher added approved Design and code approved - may be waiting for CI or backports and removed needs-ci Needs to pass CI tests labels Oct 24, 2018

simonbutcher merged commit 16b1bd8 into Mbed-TLS:development Oct 29, 2018

simonbutcher added a commit that referenced this pull request Oct 29, 2018

Add ChangeLog entry for PR #1618 - ARM DSP instruction support

4a865ef

aurel32 mentioned this pull request Nov 3, 2018

bn_mul.h: require at least ARMv6 to enable the ARM DSP code #2169

Closed

pkolbus mentioned this pull request Dec 20, 2018

mpi_mul_hlp code size is huge #1717

Open

liudongmiao mentioned this pull request Apr 2, 2019

support asm on armcc5 #2533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve bignum speed on ARM with DSP instructions #1618

Improve bignum speed on ARM with DSP instructions #1618

aurel32 commented May 8, 2018

RonEld commented May 9, 2018

aurel32 commented May 9, 2018

RonEld commented May 10, 2018

aurel32 commented May 21, 2018

aurel32 commented Oct 3, 2018

Patater left a comment

yanesca left a comment

aurel32 commented Oct 4, 2018

Patater commented Oct 5, 2018

aurel32 commented Oct 11, 2018

RonEld commented Oct 11, 2018

RonEld commented Oct 11, 2018

simonbutcher commented Oct 11, 2018

simonbutcher commented Oct 21, 2018

simonbutcher commented Oct 24, 2018

simonbutcher commented Oct 28, 2018

zv-io commented Nov 2, 2018 •

edited

Loading

aurel32 commented Nov 2, 2018

zv-io commented Nov 2, 2018

Improve bignum speed on ARM with DSP instructions #1618

Improve bignum speed on ARM with DSP instructions #1618

Conversation

aurel32 commented May 8, 2018

Description

Status

Requires Backporting

Migrations

RonEld commented May 9, 2018

aurel32 commented May 9, 2018

RonEld commented May 10, 2018

aurel32 commented May 21, 2018

aurel32 commented Oct 3, 2018

Patater left a comment

Choose a reason for hiding this comment

yanesca left a comment

Choose a reason for hiding this comment

aurel32 commented Oct 4, 2018

Patater commented Oct 5, 2018

aurel32 commented Oct 11, 2018

RonEld commented Oct 11, 2018

RonEld commented Oct 11, 2018

simonbutcher commented Oct 11, 2018

simonbutcher commented Oct 21, 2018

simonbutcher commented Oct 24, 2018

simonbutcher commented Oct 28, 2018

zv-io commented Nov 2, 2018 • edited Loading

aurel32 commented Nov 2, 2018

zv-io commented Nov 2, 2018

zv-io commented Nov 2, 2018 •

edited

Loading