-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve bignum speed on ARM with DSP instructions #1618
Conversation
@aurel32 Thank you for your contribution! Out of curiosity, do you have information on the performance improvements this PR adds? n addition, unfortunately our policy is to not accept contributions, without a Contributor’s Licence Agreement (CLA) signed or authorised by yourself or your employer. |
The speed-up are given in the description of each individual commit, this gives a total of +27% for the whole PR, but on a very specific case (RSA signing operation). I have tried to use the testsuite, and most notably test_suite_rsa, but I have found there is too much variability from one run to another (up to a factor 3). Any suggestion to test the performances is welcome.
I have accepted the CLA on os.mbed.com just before sending this pull request. |
@aurel32 Thank you for your information, and for signing the CLA! |
Sorry about the delay. I took time to rework the patches and to setup a test environment testing ECC signing, RSA signing, RSA encryption and RSA decryption on a Cortex M4 (my original target), a Cortex M7 and a Cortex A8. I have just pushed the new version. I removed the second patch defining @RonEld, the ECC signing test shows a small improvement compared to the RSA one. I have included the performance results in the commit message. The improvements are much higher on the Cortex M than on the Cortex A, likely because there are less memory access penalty. This is especially true on the Cortex M7 as I put the mbedtls heap in TCM memory. |
Any news on that? Do you need me to provide more details? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assembly changes look good to me. I had one question about the pre-processor define used.
The Cortex M4, M7 MCUs and the Cortex A CPUs support the ARM DSP instructions, and especially the umaal instruction which greatly speed up MULADDC code. In addition the patch switched the ASM constraints to registers instead of memory, giving the opportunity for the compiler to load them the best way. The speed improvement is variable depending on the crypto operation and the CPU. Here are the results on a Cortex M4, a Cortex M7 and a Cortex A8. All tests have been done with GCC 6.3 using -O2. RSA uses a RSA-4096 key. ECDSA uses a secp256r1 curve EC key pair. +--------+--------+--------+ | M4 | M7 | A8 | +----------------+--------+--------+--------+ | ECDSA signing | +6.3% | +7.9% | +4.1% | +----------------+--------+--------+--------+ | RSA signing | +43.7% | +68.3% | +26.3% | +----------------+--------+--------+--------+ | RSA encryption | +3.4% | +9.7% | +3.6% | +----------------+--------+--------+--------+ | RSA decryption | +43.0% | +67.8% | +22.8% | +----------------+--------+--------+--------+ I ran the whole testsuite on the Cortex A8 Linux environment, and it all passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
a17cedf
to
16b1bd8
Compare
retest |
I have seen that both the |
The |
retest |
retest |
1 similar comment
retest |
Jenkins passes, approved by @Patater, so ready for merge! |
Needs Changelog entry to be added at time of merge. |
* development: (282 commits) Add ChangeLog entry for PR #1618 - ARM DSP instruction support Detect unsigned integer overflow in mbedtls_ecp_check_budget() Cast number of operations to `uint` in MBEDTLS_ECP_BUDGET Remove merge conflict marker in ssl-opt.sh Fix some documentation typos and improve a comment Fix some typos in documentation and comments Add Jenkinsfile for PR job Adapt ChangeLog Fail when encountering invalid CBC padding in EtM records Add missing return value check in ECDSA test suite Remove yotta support from check-files.py Add a comment to clarify code flow Fix missing dereference. Expand test to ensure no assumption on output Improve readability by moving counter decrement Fix alignment in a macro definition Fix function name to fit conventions Add comment on internal function API Remove unnecessary calls to init() from free() Fix misleading sub-state name and comments ...
After these changes I am having difficulty building it with
which are all whereas these do work:
which are all In particular, The specific error is:
Some reference materials [PDF]: https://www.digchip.com/datasheets/download_datasheet.php?id=154499&part-number=ARM10200 |
Hi @zv-io, sorry for breaking the build on those targets. The intention of the new code was to optimize the muladd macro with DSP instructions when they are available, and leave the old code unchanged when they are not available. I guarded the code with I believe it's enough to just disable the use of DSP instructions on ARMv5. I guess the following patch should be enough:
Can you please give it a try? If it works, I'll open a pull request. |
That looks OK. These are from the
With the patch it builds. I'm not too familiar with all of the intricacies/features of ARM processors (specifically w.r.t. what is compatible and not), so I'd definitely recommend making sure you (and ideally others) review whether that patch is sufficient to prevent any regression from targets that may work but haven't been tested. |
Description
These two patches improve the bignum code (used in particular for RSA computation) on ARM CPU which have the DSP extension, that is the ARMv7E-M like the Cortex M4 and Cortex M7, or the ARMv7-A, like the Cortex A series. It has been originally developed for a Cortex M4 application, but I have done all the tests on a Cortex A8.
Status
READY
Requires Backporting
NO (enhancement)
Migrations
If there is any API change, what's the incentive and logic for it.
NO (this is just a different code path on supported CPU)