Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve bignum speed on ARM with DSP instructions #1618

Merged
merged 1 commit into from
Oct 29, 2018

Conversation

aurel32
Copy link
Contributor

@aurel32 aurel32 commented May 8, 2018

Description

These two patches improve the bignum code (used in particular for RSA computation) on ARM CPU which have the DSP extension, that is the ARMv7E-M like the Cortex M4 and Cortex M7, or the ARMv7-A, like the Cortex A series. It has been originally developed for a Cortex M4 application, but I have done all the tests on a Cortex A8.

Status

READY

Requires Backporting

NO (enhancement)

Migrations

If there is any API change, what's the incentive and logic for it.

NO (this is just a different code path on supported CPU)

@RonEld
Copy link
Contributor

RonEld commented May 9, 2018

@aurel32 Thank you for your contribution!

Out of curiosity, do you have information on the performance improvements this PR adds?

n addition, unfortunately our policy is to not accept contributions, without a Contributor’s Licence Agreement (CLA) signed or authorised by yourself or your employer.
If this is a personal contribution, the easiest way to do this is if you create an mbed account and accept this click through agreement. Alternatively, you can find a slightly different agreement to sign here, which can be signed and returned to us, and is applicable if you don't want to create an mbed account or alternatively if this is a corporate contribution.
Thanks for your understanding and again, thanks for the contribution!

@aurel32
Copy link
Contributor Author

aurel32 commented May 9, 2018

Out of curiosity, do you have information on the performance improvements this PR adds?

The speed-up are given in the description of each individual commit, this gives a total of +27% for the whole PR, but on a very specific case (RSA signing operation). I have tried to use the testsuite, and most notably test_suite_rsa, but I have found there is too much variability from one run to another (up to a factor 3). Any suggestion to test the performances is welcome.

In addition, unfortunately our policy is to not accept contributions, without a Contributor’s Licence Agreement (CLA) signed or authorised by yourself or your employer.

I have accepted the CLA on os.mbed.com just before sending this pull request.

@RonEld
Copy link
Contributor

RonEld commented May 10, 2018

@aurel32 Thank you for your information, and for signing the CLA!
Have you tried ECC operations as well?

@simonbutcher simonbutcher changed the title Improve bignum speed on ARM with DSP instrutctions Improve bignum speed on ARM with DSP instructions May 14, 2018
@simonbutcher simonbutcher added the component-crypto Crypto primitives and low-level interfaces label May 18, 2018
@aurel32
Copy link
Contributor Author

aurel32 commented May 21, 2018

Sorry about the delay. I took time to rework the patches and to setup a test environment testing ECC signing, RSA signing, RSA encryption and RSA decryption on a Cortex M4 (my original target), a Cortex M7 and a Cortex A8. I have just pushed the new version.

I removed the second patch defining MULADDC_HUIT, as I realized it was giving a small performance penalty for ECC signing. Instead I reworked the ASM constraints and value passing to remove even more instructions.

@RonEld, the ECC signing test shows a small improvement compared to the RSA one. I have included the performance results in the commit message.

The improvements are much higher on the Cortex M than on the Cortex A, likely because there are less memory access penalty. This is especially true on the Cortex M7 as I put the mbedtls heap in TCM memory.

@aurel32
Copy link
Contributor Author

aurel32 commented Oct 3, 2018

Any news on that? Do you need me to provide more details?

Copy link
Contributor

@Patater Patater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assembly changes look good to me. I had one question about the pre-processor define used.

include/mbedtls/bn_mul.h Outdated Show resolved Hide resolved
The Cortex M4, M7 MCUs and the Cortex A CPUs support the ARM DSP
instructions, and especially the umaal instruction which greatly
speed up MULADDC code. In addition the patch switched the ASM
constraints to registers instead of memory, giving the opportunity
for the compiler to load them the best way.

The speed improvement is variable depending on the crypto operation
and the CPU. Here are the results on a Cortex M4, a Cortex M7 and a
Cortex A8. All tests have been done with GCC 6.3 using -O2. RSA uses a
RSA-4096 key. ECDSA uses a secp256r1 curve EC key pair.

                 +--------+--------+--------+
                 |   M4   |   M7   |   A8   |
+----------------+--------+--------+--------+
| ECDSA signing  |  +6.3% |  +7.9% |  +4.1% |
+----------------+--------+--------+--------+
| RSA signing    | +43.7% | +68.3% | +26.3% |
+----------------+--------+--------+--------+
| RSA encryption |  +3.4% |  +9.7% |  +3.6% |
+----------------+--------+--------+--------+
| RSA decryption | +43.0% | +67.8% | +22.8% |
+----------------+--------+--------+--------+

I ran the whole testsuite on the Cortex A8 Linux environment, and it
all passes.
yanesca
yanesca previously approved these changes Oct 4, 2018
Copy link
Contributor

@yanesca yanesca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@aurel32
Copy link
Contributor Author

aurel32 commented Oct 4, 2018

@Patater, @yanesca, thanks for the review. I have just pushed a new version fixing the check for ARM DSP support as pointed by @Patater.

@Patater
Copy link
Contributor

Patater commented Oct 5, 2018

retest

@Patater Patater added the needs-ci Needs to pass CI tests label Oct 5, 2018
@aurel32
Copy link
Contributor Author

aurel32 commented Oct 11, 2018

I have seen that both the mbed TLS build & test and the mbed-ci-generic have failed. Unfortunately I am not able to get the details. Would it be possible to get them, so that I can try to provide a fix to the issues found?

@RonEld
Copy link
Contributor

RonEld commented Oct 11, 2018

The mbed TLS build & test failed on a timing test, which is a known issue in the test.
I will run the tests again

@RonEld
Copy link
Contributor

RonEld commented Oct 11, 2018

retest

@RonEld RonEld added needs-ci Needs to pass CI tests and removed needs-ci Needs to pass CI tests labels Oct 11, 2018
@simonbutcher
Copy link
Contributor

retest

1 similar comment
@simonbutcher
Copy link
Contributor

retest

@simonbutcher
Copy link
Contributor

Jenkins passes, approved by @Patater, so ready for merge!

@simonbutcher simonbutcher added approved Design and code approved - may be waiting for CI or backports and removed needs-ci Needs to pass CI tests labels Oct 24, 2018
@simonbutcher
Copy link
Contributor

Needs Changelog entry to be added at time of merge.

@simonbutcher simonbutcher merged commit 16b1bd8 into Mbed-TLS:development Oct 29, 2018
mpg added a commit that referenced this pull request Oct 30, 2018
* development: (282 commits)
  Add ChangeLog entry for PR #1618 - ARM DSP instruction support
  Detect unsigned integer overflow in mbedtls_ecp_check_budget()
  Cast number of operations to `uint` in MBEDTLS_ECP_BUDGET
  Remove merge conflict marker in ssl-opt.sh
  Fix some documentation typos and improve a comment
  Fix some typos in documentation and comments
  Add Jenkinsfile for PR job
  Adapt ChangeLog
  Fail when encountering invalid CBC padding in EtM records
  Add missing return value check in ECDSA test suite
  Remove yotta support from check-files.py
  Add a comment to clarify code flow
  Fix missing dereference.
  Expand test to ensure no assumption on output
  Improve readability by moving counter decrement
  Fix alignment in a macro definition
  Fix function name to fit conventions
  Add comment on internal function API
  Remove unnecessary calls to init() from free()
  Fix misleading sub-state name and comments
  ...
@zv-io
Copy link
Contributor

zv-io commented Nov 2, 2018

After these changes I am having difficulty building it with -mcpu=arm10e on the following targets:

  • arm-linux-musleabihf
  • armeb-linux-musleabihf
  • armel-linux-musleabihf
  • armv5l-linux-musleabihf
  • armv7l-linux-musleabihf
  • armv7r-linux-musleabihf

which are all -mcpu=arm10e -mfloat-abi=hard -mtls-dialect=gnu -marm -march=armv5te+fp

whereas these do work:

  • arm-linux-musleabi
  • armeb-linux-musleabi
  • armel-linux-musleabi
  • armv7m-linux-musleabi

which are all -mcpu=arm10tdmi -mtls-dialect=gnu -marm -march=armv5t.

In particular, -mcpu=arm10tdmi conflicts with -march=armv5te switch and this processor (-mcpu=arm10tdmi) lacks an FPU, so hard-float configurations are seemingly impossible to use now?

The specific error is:

Error: selected processor does not support `umaal r1,ip,r3,r0' in ARM mode

Some reference materials [PDF]: https://www.digchip.com/datasheets/download_datasheet.php?id=154499&part-number=ARM10200

@aurel32
Copy link
Contributor Author

aurel32 commented Nov 2, 2018

Hi @zv-io, sorry for breaking the build on those targets. The intention of the new code was to optimize the muladd macro with DSP instructions when they are available, and leave the old code unchanged when they are not available. I guarded the code with __ARM_FEATURE_DSP, but it happens it's not enough. In your case the ARMv5TE has DSP instructions (that's the meaning of the E), but only in Thumb mode and not in ARM mode. So the issue arise when both -marm and -march=armv5te are used together. The DSP instructions are available in both ARM and Thumb mode in ARMv6 and ARMv7.

I believe it's enough to just disable the use of DSP instructions on ARMv5. I guess the following patch should be enough:

diff --git a/include/mbedtls/bn_mul.h b/include/mbedtls/bn_mul.h
index 0af694c7c..85d1e7172 100644
--- a/include/mbedtls/bn_mul.h
+++ b/include/mbedtls/bn_mul.h
@@ -636,7 +636,7 @@
            "r6", "r7", "r8", "r9", "cc"         \
          );
 
-#elif defined (__ARM_FEATURE_DSP) && (__ARM_FEATURE_DSP == 1)
+#elif (__ARM_ARCH >= 6) && defined (__ARM_FEATURE_DSP) && (__ARM_FEATURE_DSP == 1)
 
 #define MULADDC_INIT                            \
     asm(

Can you please give it a try? If it works, I'll open a pull request.

@zv-io
Copy link
Contributor

zv-io commented Nov 2, 2018

That looks OK. These are from the arm-linux-musleabihf toolchain. I have not yet tested the remaining hard-float ones above but suspect they'll also work.

#define __ARM_ARCH_ISA_ARM 1
#define __ARM_ARCH_5TE__ 1
#define __ARM_ARCH_ISA_THUMB 1
#define __ARM_ARCH 5
...
#define __ARM_FEATURE_QBIT 1
#define __ARM_FEATURE_CLZ 1
#define __ARM_FEATURE_COPROC 7
#define __ARM_FEATURE_DSP 1

With the patch it builds. I'm not too familiar with all of the intricacies/features of ARM processors (specifically w.r.t. what is compatible and not), so I'd definitely recommend making sure you (and ideally others) review whether that patch is sufficient to prevent any regression from targets that may work but haven't been tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Design and code approved - may be waiting for CI or backports component-crypto Crypto primitives and low-level interfaces enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants