Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify improvements in baremetal branch that should be upstreamed #3535

Open
mpg opened this issue Jul 31, 2020 · 11 comments
Open

Identify improvements in baremetal branch that should be upstreamed #3535

mpg opened this issue Jul 31, 2020 · 11 comments

Comments

@mpg
Copy link
Contributor

mpg commented Jul 31, 2020

The baremetal branch of Mbed TLS includes changes, features and options that allow to reduce the size of the code and/or the RAM usage of the library. Most of those would be of general interest to people who use Mbed TLS in highly constrained environments and would be suitable for upstreaming; while some of them are either too specific, or would conflict with or be made redundant by future development (such as the move the PSA Crypto).

This task is to:

  • go over the history of the baremetal branch to identify all changes that are good candidates for upstreaming
  • create two epics: "RAM improvements" and "Code size improvements" and start populating them with tasks based on the identified changes (each task should include a reference to the original PR(s) and any other relevant information)

This task does not include doing any of the actual side-porting/upstreaming - this work will need to be planned later from the above epics.

@mpg mpg added this to the August 2020 Sprint milestone Jul 31, 2020
@mpg mpg self-assigned this Jul 31, 2020
@mpg
Copy link
Contributor Author

mpg commented Aug 27, 2020

I used this to list all PRs that have been merged in the baremetal branch:

git log --reverse --merges --format='%s' mbedtls-2.16..baremetal |
    sed '/mbedtls-2.16. into/d
        /baremetal. into/d
        s/.* pull request #\([0-9]*\) from .*/\1/g
        s/.* remote-tracking branch [^ ]*\/\([0-9]*\). into .*/\1/
        s#^[0-9][0-9][0-9]$#https://github.com/ARMmbed/mbedtls-restricted/pull/&#
        s#^[0-9]*$#https://github.com/ARMmbed/mbedtls/pull/&#
        '

The output consists of 151 PRs. I did a first pass over that to annotate them with:

  • desirability for upstream: yes / no / maybe / done
  • type of change: RAM / code size / tests / measurements / bugfix / other
  • general area: crypto / x509 / ssl - to be refined later for PRs of interest

I'm attaching the resulting prs-baremetal.txt file for reference, but I don't suggest anyone reviews it, as this would be tedious and provide low value.

I'll report the results by category in follow-up comments, making use of the following script:

for want in NO DONE MAYBE YES; do
    printf "\n%s\n" $want
    sed -n "s/.*$want - //p" prs-baremetal.txt |
        sort | uniq -c | sort -rn |
        awk '{t += $1; print} END { print "total:", t }'
done

@mpg
Copy link
Contributor Author

mpg commented Aug 28, 2020

Rejected PRs

There are 67 PRs in the baremetal branch I'm suggesting to reject as out-of-scope or inapplicable, which fall in the following categories:

  • Fault Injection (25 PRs): out of scope, currently this is well outside our threat model. (Also, the protection in the baremetal branch is very partial, as it only covers the small subset of the code base that's of interest in that branch.)
  • Use tinycrypt (13 PRs): not applicable, if we did something similar upstream we'd probably do it very differently (based on PSA crypto API for example), and it's not clear we'd want to use tinycrypt anyway.
  • Side-channel attacks (13 PRs): out of scope: currently this is outside our threat model, especially for symmetric crypto. We might revisit this later, but even some counter-measures were IMO a bit hackish (dummy rounds in AES) so might not be suitable for the main branch.
  • Regular merges from 2.16 (12 PRs): not applicable
  • Follow-ups (3 PRs): not applicable, fixes for issues introduced by other rejected PRs
  • project-specific (1 PR): not applicable, adapting to project-specific infrastructure

@mpg
Copy link
Contributor Author

mpg commented Aug 28, 2020

PRs that were already upstreamed

I've identified 28 PRs that don't need upstreaming, either because it was already done, or because the PR is actually a side-port from something that was first done upstream. For information, they fall in the following categories:

  • SSL context serialisation (9 PRs), (D)TLS Connection ID extension (5 PRs) and the (D)TLS record checking (3PRs) features were upstreamed immediately as they're required server-side, and the the server uses the main branch, no the baremetal one.
  • Various bug fixes (5 PRs) were side-ported from development to the baremetal branch.
  • X.509 RAM improvements started before the baremetal branch was created, so the first 4 PRs (including 1 test PR) here are already upstream (only the last one isn't, see below)
  • 2 PRs with code size improvements first came to the development branch as community contributions and were then side-ported to the baremetal branch.

@mpg
Copy link
Contributor Author

mpg commented Aug 28, 2020

PRs that we may or may not want to upstream

There are 20 PRs here, that I'll group in 3 series.

1. Measurements: There are 16 PRs that relate directly or indirectly to measuring the size of Mbed TLS in the configuration of interest to the baremetal branch.

  • A new reference configuration configs/baremetal.h is introduced (and progressively augmented to include all the config options that reduce code size as they are introduced), with a variant that adds convenience things for runtime tests (configs/baremetal_test.h).
  • A new script scripts/baremetal.sh gives figures for code size (of the library, ie before linker garbage collection) with the based config and RAM usage (of ssl_client2, based on valgrind's massif tool) with the variant for runtime tests.
  • A new component is added to all.sh to run tests with this config (actually the slightly modified one), and changes are made to ssl-opt.sh so that it can be run entirely with that config and have decent coverage (which includes a new familiy of test certificates using only the P-256 curve).

Reason(s) against upstreaming this: the config was designed with a specific use case in mind.

Reasons for upsteaming this include:

  • while written with a specific use case in mind, the config is quite representative of most minimal modern configs
  • if you want to improve something, you need to start by measuring it - this script is a significant improvement over our existing measurements
  • the test improvements (especially in ssl-opt.sh) are good to have, and strongly interesct with things we've been meaning to do for a long time

Foreseeable difficulties in upstreaming (or indeed not upstreaming) this: it's probably going to be hard to upsteam all PRs in this series at once, as later PRs in this series will depend on other code size PRs (introducing new config.h) options. Conversely, not upstreaming this series is likely to create issue when upstreaming some code size PRs that update configs/baremetal.h or rely on improvements in this series for testing.

Recommendation: I'm quite in favour of upstreaming this, but I'd like to know what others think.

List of PRs in this series:
#2524
https://github.com/ARMmbed/mbedtls-restricted/pull/587
https://github.com/ARMmbed/mbedtls-restricted/pull/588
https://github.com/ARMmbed/mbedtls-restricted/pull/595
https://github.com/ARMmbed/mbedtls-restricted/pull/607
https://github.com/ARMmbed/mbedtls-restricted/pull/626
https://github.com/ARMmbed/mbedtls-restricted/pull/632
https://github.com/ARMmbed/mbedtls-restricted/pull/643
https://github.com/ARMmbed/mbedtls-restricted/pull/646
https://github.com/ARMmbed/mbedtls-restricted/pull/655
#2849
#2867
#2850
#2892
#2910
#2978

2. Code size: reduce cost of the MD and PK layers when a single alg is available

Overview: make the MD layer zero-cost (in terms of code-size and runtime indirection) when a config.h option signals that a single hash is used. Concretely, if only SHA-256 is used, a call to mbedtls_md_xxx() from say X509 or TLS will be replaced by the equivalent call to mbedtls_sha256_xxx() at compile-time. Start in the same direction in the PK layer, but not going as far (call to mbedtls_pk_xxx() replaced by the xxx_wrap() function). See PR descriptions for more details.

Reasons for upstreaming: who doesn't want to save hundreds of bytes? (probably more than 1K if we complete the work in the PK layer (see PR description) and do something similar in the Cipher layer as well)

Reasons against upstreaming:

  • The MD, PK, and Cipher layers are on there way out, with PSA Crypto API being the future.
  • Those are large changes, not easy to review
  • Indirection is increased, and pre-processor cleverness applied, which might reduce readability of the resulting code
  • Lots of things end up living in public header files (how else could the compiler resolve and inline everything?) which is kind of contrary to our general goal of reducing our API surface.

Recommendation: I'm inclined not to upstream this as-is, but perhaps study it to see if some of the ideas could be applied in the implementation of the PSA Crypto API without compromising other goals. (Note: the PK PR is probably easier to review, but the MD PR goes further, as explained in the PK PR's description.)

3. Code size: Make all of X.509 as single compilation unit

Overview: turn all of the X509 library into a single translation unit, by #includeing other .c files from it; make a lot more functions static.

Reasons for upstreaming: who doesn't want to save hundreds or bytes, especially by just letting the compiler to more work for us?

Reasons against upstreaming: the way it's done is not very clean and make cause issues for people importing Mbed TLS in another project (as demonstrated by the Mbed OS issue).

Recommendation: I don't think we should upstream as-is, but we should definitely looks for ways to achieve the same result in a cleaner way. The savings are quite significant (and making more functions static reduce our ABI surface as a bonus).

@mpg
Copy link
Contributor Author

mpg commented Aug 28, 2020

PRs suitable for upstreaming - outliers

Most remaining PRs that are suitable for upstreaming are in the "code size" category (some of them also save a bit of RAM), except two that I'll mention now to get them out of the way. I'll complete my review of code size PRs on Monday.

RAM: X.509 parsing on-demand: This was started by Hanno as a PR in the development branch, then side-ported to baremetal where it was finalized and merged, and the original development PR was never merged. It now needs rebasing for conflict resolution (which Hanno is prepared to do) and review (which needs resourcing as the PR is large en complex, but Manuel already reviewed it on the baremetal side, so could act as a first reviewer).

Misc: remove an unnecessary cast in X.509
#3509 - one-line change, why not clean it up in development as well?

@yanesca
Copy link
Contributor

yanesca commented Aug 28, 2020

For categories 2 and 3 I feel the same way as you do: I don't think it is worth upstreaming them, but we can learn from them, when we are implementing similar improvements.

For category 1 PRs I was against upstreaming at first, because the sheer number of PRs was intimidating. Taking a closer look showed that most of them are really tiny or consist mainly test data (certificates). There are maybe 2 or 3 medium sized PRs that would take longer to review.

I very much would like to see some code size measurements in our tests: from time to time we do code size optimisations, but without measuring and tracking it all that effort just falls in the category of premature optimisation.

I have some concern about taking that somewhat arbitrary configuration as the standard for our measurements, but as you say it is fairly reasonable configuration and I think it is a good starting point.

In summary, I am for upstreaming the category 1 PRs as well.

@mpg
Copy link
Contributor Author

mpg commented Aug 31, 2020

PRs suitable for upstreaming: code size

This is the bulk of upstreamable PRs. The numbers at the end are the impact on code size with ARM-GCC (as measured with scripts/baremetal.h, see the "measurements" series above.

Note: I've tried to group PRs than are simple fixes or follow-up of another PR with the original. I've also tried to document dependencies but I might have missed a few. Same for possible API/ABI break (I might also have flagged some that are actually not problematic, so the flag just means "needs attention/discussion").

New config.h options to make more code optional

We already have ways to configure many things out of the build, but it turned out that we could use more of those.

Hardcoding of SSL options

The second PR in this series introduces a new concept, used by all subsequent PRs: whenever we have something in SSL that can be configured at run-time (eg: list of allowed algorithms, use of extensions, protocol versions, etc.), if we know at compile-time what that option will be, we can say so in config.h and then remove the code that allowed to set this option and negotiate it with the peer. This is done via a set of new MBEDTLS_SSL_CONF_xxx options introduced by PRs in this series.

Note: all PRs in this series depend the first two: more precisely, they all depend on r592 which depends on r590.

  • r590 [preparatory] option to enforce EMS (session hash) extension +136
  • r592 groundwork: hardcode use of EMS (session hash) extension -126
  • r594 various options -108 -72 -76 -16 -136/120 -32 -104
  • r596 callbacks (RNG, IO, timer): -64 -92 -28
  • r599 [preparatory] save dynamic alloc of supported curves -76 (prep for: r606 r602)
  • r606 ciphersuite (dep: r599) -1246
  • r593 + r627 (fixup): version range -712 (fixed by r627)
  • r602 + r634 (fixup): elliptic curve (dep: r599) (fixed by: r602) -268
  • r604 signature hash -178
  • r628 add tests and fixes (dep: r627 r626) (note: r626 is part of measurements)

Misc (SSL)

  • r586 remember peer CRT only if renego enabled -308
  • r610 delay sending alerts -148
  • r612 rm compression field if disabled (API break? 3.0?) -24
  • r603 rm debug pointers from structure if debug compiled out (API break? 3.0?)
  • r633 rm minlen from transform structure (internal) -20
  • r641 rm ssl_calc_xxx function pointers (internal) -108
  • r635 adapt version encoding in DTLS-only builds (dep: r656) -68

Misc (X.509)

Note: the two PRs in this series both depend on r584, which is X.509 on-demand parsing (part of RAM optimisation work) (upstream: #2478) - mentioned in the previous comment

Note: since those two PRs are about removing unused fields of technically-public structures in some configurations, API/ABI compatibility questions need to be checked before upstreaming, or we might choose to include this in 3.0 work.

  • r589 x509 misc: rm time code if !TIME_DATE -88
  • r605 x509 misc: rm struct fields unused depending on config.h (dep: r589) -440

Misc (other)

  • r649 misc crypto/x509/ssl: optimize structures !ABI break! -344
  • r657 misc crypto/ssl: use shared functions for char/uint conversions -176 -118

@mpg
Copy link
Contributor Author

mpg commented Aug 31, 2020

Code-size saving PRs sorted by bytes saved

The table below lists PRs that have code-size savings mentioned in their description, sorted from larger to smallest savings. For each change, only the "base" PR is listed (not follow-ups for bugfixes and tests, nor preparatory work). Some PRs making multiple changes, the total code size savings of all changes in that PR is used.

Note: excluded from this table is the change with the largest impact (use TinyCrypt to replace our ECC implementation, including bignum), because I don't think it's suitable for upstreaming. Its exact impact is harder to measure (a large part of bignum would be removed by the linker anyway) but was likely of the order of 4 to 8 KiB.

Bytes saved main PR group short description
2384 #2891 config.h crypto: aes: option for encrypt only
1438 r601 config.h ssl: session resumption
1246 r606 ssl-conf ciphersuite
1192 r585 config.h ssl: DTLS-only builds
712 r593 ssl-conf version range
696 r652 MAYBE-x509 single translation unit
652 r642 config.h x509/ssl: hostname verification
528 r594 ssl-conf various options
485 r654 MAYBE-layers MD layer
476 #2835 MAYBE-layers PK layer
440 r605 x509 misc rm struct fields unused depending on config.h
344 r649 misc mixed optimize structures
308 r586 ssl misc remember peer CRT only if renego enabled
294 r657 misc mixed use shared functions for char/uint conversions
268 r602 ssl-conf elliptic curve
268 #2890 config.h crypto: aes: option for 128-bit keys only
220 r609 config.h x509/ssl: verification callbacks
188 r621 config.h crypto: SHA-256 without SHA-224 #6784
184 r596 ssl-conf callbacks (RNG, IO, timer)
178 r604 ssl-conf signature hash
148 r610 ssl misc delay sending alerts
126 r592 ssl-conf groundwork: hardcode use of EMS extension
108 r641 ssl misc rm ssl_calc_xxx function pointers
88 r589 x509 misc rm time code if !TIME_DATE
76 r599 ssl-conf prep: save dynamic alloc of supported curves
68 r635 ssl misc adapt version encoding in DTLS-only builds
24 r612 ssl misc rm compression field if disabled
20 r633 ssl misc rm minlen from transform structure

Total savings from items in the "YES" category: 11502 bytes.
Additional savings from items in the "MAYBE" category: 1657 bytes.

@gilles-peskine-arm
Copy link
Contributor

I learned about the impact of structure field ordering on code size from the work on the baremetal branch., yet I couldn't find something corresponding to #4747 in this list. Did I miss it?

@hanno-becker
Copy link

hanno-becker commented Jul 2, 2021

@gilles-peskine-arm IIRC it was r649. The logic applied there, however, was not to move commonly used fields to the top, but rather sort them in size from smallest to biggest. This helps because the maximum immediate offset is 127 * the size of the access, so 32-bit accesses have a large max. offset than byte accesses. E.g. if you have a struct with 128 byte fields, 64 halfword fields and 32 word fields, you can access all of them from the base of the structure through an immediate offset if and only if you order them in precisely this way.

It might be worth revisiting whether the logic you apply might be more suitable in places.

@mpg
Copy link
Contributor Author

mpg commented Feb 9, 2023

Note: since this study was done, one of the changes listed above has been made independently in development:

188 r621 config.h crypto: SHA-256 without SHA-224

This was done in #6784 as part of cleaning up before driver-only hashes work.

(I didn't check if other changes have happened, I just noticed this one.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants