Skip to content

Lightweight cryptography

License

Notifications You must be signed in to change notification settings

arachsys/pocketcrypt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pocketcrypt
===========

Pocketcrypt is a tiny legacy-free cryptographic library providing duplex
constructions using Gimli or Xoodoo, together with X25519 for key exchange
and Schnorr signatures. Eschewing interoperability with the standard museum
of primitives and protocols, it offers concise, easily-understood code that
avoids the ugly boilerplate and obfuscation of larger libraries.

Beware! This early version of Pocketcrypt is neither formally audited nor
officially released. Safely composing these low-level primitives requires
cryptographic expertise. Gimli and Xoodoo are also relatively new and still
under active cryptanalytic scrutiny. Please review the relevant literature
and audit the implementation carefully before considering this library.


Duplex
======

duplex.h is a fast, architecture-independent implementation of the Gimli and
Xoodoo permutations using gcc/clang vector extensions, together with
associated duplex operations.

To use this header-only library, copy duplex.h into your tree. There are no
link-time dependencies.


Initialising and permuting state
--------------------------------

Define duplex_permute as duplex_gimli or duplex_xoodoo before including
duplex.h to explictly select a permutation. The default is duplex_xoodoo.

Both permutations operate over a state of twelve 32-bit words, manipulated
here as a vector array uint32x4_t[3]. When individual bytes of the state are
required, the uint32_t words are accessed little-endian, ensuring results
are independent of platform byte order.

Internally, higher-level duplex operations absorb and squeeze bytes in
16-byte chunks corresponding to the permutation rate. By maintaining a byte
counter alongside the duplex state, operations can also seamlessly absorb,
squeeze, encrypt and decrypt partial chunks. The combined state and counter
are stored as duplex_t, a type alias for uint32x4_t[4].

For convenience, duplex.h defines duplex_rate as 16 (the permutation rate)
and duplex_size as sizeof(duplex_t) to help avoid magic numbers for nonce,
tag and state sizes in client code.

Initialise a duplex state including the byte counter with

  duplex_t state = { 0 };

Use duplex_counter(state) to dereference the uint64_t counter as an lvalue.

To directly permute the state, use

  duplex_permute(state);

Typically this is bundled into the higher-level operations.


Absorbing and squeezing data
----------------------------

Absorb a buffer of bytes into a duplex state with

  duplex_absorb(state, data, length);

Each rate-sized chunk is absorbed in turn, then the state is permuted before
continuing. Once any final partial chunk is absorbed, the counter will
advance by length ready for the next duplex operation.

Squeeze data from a duplex state into a buffer with

  duplex_squeeze(state, data, length);

Bytes are extracted from the duplex state into the buffer, permuting the
state after each rate-sized chunk. Once any final partial chunk is squeezed,
the counter will advance by length as with duplex_absorb().


Encryption and decryption
-------------------------

To encrypt a plaintext using a duplex state, for example after absorbing a
shared key and nonce, call

  duplex_encrypt(state, data, length);

Similarly, decrypt a ciphertext using a duplex state with

  duplex_decrypt(state, data, length);

The rate portion of the state is mixed into rate-sized chunks of the buffer
in turn. Before permuting and continuing to the next chunk, the plaintext is
absorbed back into the rate. (For encryption this is the original chunk; for
decryption it is the updated chunk.) The counter will advance by length.

To implement authenticated encryption, squeeze and append a rate-sized tag
after encrypting a message and padding the state. This can then be checked
against duplex_rate bytes squeezed by the recipient after decryption.


Padding
-------

To prevent extension attacks following an operation on variable-length data,
pad the state with

  duplex_pad(state);

after the final duplex_absorb(), duplex_encrypt() or duplex_decrypt().

This pads the final partial/empty chunk then forks the capacity part of the
state before permuting it, as described in the Gimli NIST submission. The
counter will advance to the next multiple of the rate.

When designing protocols, compulsory fixed-length inputs that are naturally
multiples of duplex_rate such as keys or nonces can safely be absorbed
unpadded. However, to avoid extension attacks, variable-length messages must
be padded even if they happen to be an exact multiple of the rate.


Ratcheting
----------

To ensure forward secrecy during a session, call

  duplex_ratchet(state);

This irreversibly ratchets the duplex state by zeroing the rate portion of
the state and permuting, thus preventing rollback even from a completely
compromised state. The counter will advance by exactly duplex_rate.


Constant-time comparison
------------------------

Compare two equal-sized byte arrays in constant time with

  duplex_compare(a, b, length);

This returns 0 for equality, -1 otherwise. If a or b is null, the other
argument is compared with zero.

This is useful for validating authentication tags or checking other secret
data without inadvertently revealing the location of the first discrepancy
through the time taken to detect it.


Clearing sensitive data
-----------------------

Clear sensitive data such as keys or cleartext using

  duplex_zero(data, length);

Unlike memset() or bzero(), the compiler is forbidden to optimise this away
as a 'dead store' even if the buffer is subsequently discarded or unused.


Serialising duplex state
------------------------

Use duplex_byte(state, index) to dereference individual bytes of a duplex
state as lvalues, where 0 <= index < duplex_size. This macro may evaluate
the index argument more than once.

The resulting representation is independent of host byte order: the twelve
32-bit state words are accessed in turn, followed by the 64-bit counter,
each in standard little-endian order.

The last eight bytes of duplex_t are serialised as a second 64-bit integer.
They are unused by duplex.h and can safely be discarded. Alternatively, use
duplex_extra(state) to address them as an auxiliary uint64_t lvalue.

For example, to serialise state to uint8_t packed[56], use

  for (size_t i = 0; i < sizeof(packed); i++)
    packed[i] = duplex_byte(state, i);

and it can later be restored from this packed form with

  for (size_t i = 0; i < sizeof(packed); i++)
    duplex_byte(state, i) = packed[i];

Similar loops can be used to store or stream serialised state directly.


Implementation notes
--------------------

This is a straightforward vector conversion, trivial to check against the
reference Gimli and Xoodoo implementations. There is no manual unrolling of
loops and architecture-specific intrinsics are not needed, but compiled with
-O3 -march=native using gcc or clang on x86-64, this code is as fast as the
optimised SSE3 Gimli submitted to NIST and the Keccak team's SSE2 Xoodoo in
XKCP. Modern gcc and clang compile vector extensions impressively well.

At the time of writing, duplex.h runs about 5% faster compiled with clang
13.0.1 than with gcc 11.2.0 on AMD Ryzen 4800U. Byte-by-byte duplex calls
achieve 66% and 62% of the throughput of bulk encryption and decryption on
gcc and clang respectively, rising to 99% and 96% for 16-byte operations.

Even with state-of-the-art compilers, vector types are worthwhile. On the
same AMD Ryzen 4800U, when rewritten as a loop over four uint32_t columns,
permutation takes 30% longer with clang 13.0.1 and more than double the
time with gcc 11.2.0. Similarly, a simpler byte-by-byte absorb/squeeze loop
sacrifices around 20% bulk throughput with both compilers.

Although duplex.h works on both little- and big-endian architectures, it
will refuse to build on a mixed-endian system even if you contrive one with
sufficient gcc support. Both duplex_gimli() and duplex_xoodoo() explicitly
shuffle uint32x4_t rows as uint8x16_t vectors to optimise byte-aligned
rotations and word-exchanges. This improves Xoodoo and Gimli throughput by
10% and 20% respectively with gcc 11.2.0, although the performance gap is
much smaller on clang.

For information on Gimli, see https://gimli.cr.yp.to/gimli-20170627.pdf and
the documentation at https://csrc.nist.gov/projects/lightweight-cryptography
where Gimli is a candidate. Sponge and duplex constructions are well-covered
by https://eprint.iacr.org/2011/499.pdf including generic security analysis.

For information on Xoodoo, see https://keccak.team/xoodoo.html and
https://eprint.iacr.org/2018/767.pdf - but note the Pocketcrypt duplex
construction is the simple one specified for Gimli rather than the more
complicated Xoodyak cyclist object.


X25519
======

This X25519 implementation is adapted and extended from the elegant code in
Mike Hamburg's STROBE protocol framework at https://strobe.sourceforge.io/
It is very portable but can detect and take advantage of 128-bit integer
types where they are available.

Prototypes for available operations are in x25519.h. Code using them must be
linked against x25519.c. To use the library, copy both files into your tree.

Both curve points (represented by Montgomery x-coordinates) and scalars are
manipulated as 32-byte little-endian arrays. The correct-sized array type
and standard base point (generator) are defined by x25519.h as

  typedef uint8_t x25519_t[x25519_size];
  const x25519_t x25519_base = { 9 };

x25519.c assumes constant-time integer multiplication. This is valid for
modern x86-64 and arm64 processors, but variable-time multiplies on some
embedded platforms may introduce timing leaks.

The library currently runs faster compiled with clang -O3 than with gcc -O3.
For best performance on clang, build with aggressive function inlining using
-mllvm -inline-threshold=5000 to obtain 30-35% faster code. However, raising
the analogous -finline-limit value on gcc appears to hinder performance.


Scalar multiplication
---------------------

Multiply an x25519_t curve point by an x25519_t scalar with

  x25519(out, scalar, point);

x25519() returns -1 if the product vanishes, otherwise 0. This is used to
detect non-contributory behaviour as described below.

Generate a key pair (sk, pk) by randomising sk then calling

  x25519(pk, sk, x25519_base);

to calculate pk.

Similarly, calculate a shared secret corresponding to sk and pk by calling

  x25519(key, sk, pk);

x25519(key, sk1, pk2) and x25519(key, sk2, pk1) will generate the same key
for all pairs (sk1, pk1) and (sk2, pk2). The computational ECDH assumption
is that recovering this key with neither sk1 nor sk2 is infeasible.

Shared keys have high entropy but as curve points they are not free of
structure. They are safe to absorb into a duplex construction or otherwise
hash to obtain unbiased bits.

For an overview of X25519, see https://cr.yp.to/ecdh/curve25519-20060209.pdf
and https://tools.ietf.org/html/rfc7748 sections 4.1 and 6.1.


Small-subgroup confinement attacks
----------------------------------

Some protocols require that neither participant has sole control of a shared
secret. However, the curve has cofactor 8 and its twist has cofactor 4, so
there exist a handful of low-order torsion points which generate at most 8
distinct scalar products. An attacker might submit these as public keys.

The simplest way to detect this non-contributory behaviour is to generate
secret keys as multiples of eight by masking with sk[0] &= 0xf8 after
randomising sk[]. Such keys annihilate any torsion component, so if x25519()
returns 0, the scalar product is non-zero and the key exchange was safe.

RFC 7748 itself specifies a clamping operation on both scalars and points,
implemented as clamp() in test/known-x25519.c. Alas, rather than framing it
as part of key generation, the RFC bundles it into key exchange. This is
unfortunate as it is not well-defined modulo the base point order so doesn't
preserve group structure, causing problems if (x + y)P = xP + yP is needed,
such as with Schnorr signatures. It is not bundled into this x25519().

For cases where the scalar is derived rather than generated (perhaps from
hierarchical key assignment) and where the group structure needs to be
preserved, a cleaner option is to map it to a torsion-safe representative.

Given any 32-byte scalar, use

  x25519_scalar(out, scalar)

to cheaply calculate a representative congruent to the original modulo the
order of the base point, and so whose product with valid public keys in the
prime-order subgroup is unchanged, but which is also a multiple of eight to
annihilate any torsion component under scalar multiplication.

None of this complexity is needed for standard Diffie-Hellman exchanges, and
wherever possible, it is preferable to design protocols that do not rely on
contributory behaviour.


Scalar inversion
----------------

Invert a scalar modulo the order of the X25519 base point with

  x25519_invert(out, scalar);

This is typically used to remove a blinding factor from a point in
oblivious pseudorandom functions. For example, given a compound scalar
product rsG, further multiplying by the scalar inverse of r or s will
recover sG or rG respectively.


Mapping field elements to curve points
--------------------------------------

For any field element, call

  x25519_point(out, element);

to map it onto a curve point using the Elligator 2 mapping with non-square
parameter u = 2. See https://elligator.cr.yp.to/elligator-20130828.pdf for
more details.

A point on the full curve is returned, not necessarily in the prime-order
subgroup. If an attacker has control of the input element in a protocol, the
earlier discussion of small-subgroup confinement might be relevant.

The map is efficiently invertible and its range is around half of the points
on the curve. Assuming the probability of efficiently calculating discrete
logarithms for random curve points is negligible, the same is therefore true
for the images of random field elements under this function.

This is a building block for hash-to-curve functions. For example, if two
parties already share a secret duplex state, an ephemeral key exchange can
be authenticated by substituting a secret point for the standard base point.
Squeeze a uniformly-distributed x25519_t field element from the shared
state, map it to a curve point using this function, then pick multiples of
eight as ephemeral key-exchange secrets to avoid small-subgroup attacks.


Signatures
----------

STROBE-compatible X25519 Schnorr signatures are also supported. These are
different from standard Ed25519 signatures, but they minimise additional
code expenditure in protocols based around X25519 key exchange.

To sign a 32-byte scalar challenge c with identity key pair (a, A), generate
an ephemeral key pair (e, E) then call

  x25519_sign(response, challenge, ephemeral, identity);

This calculates the scalar response s = e + ca (mod l), where l is the order
of the X25519 base point. Discard the ephemeral secret e. The signature is
the 64-byte pair (E, s).

Given a scalar response s, scalar challenge c, ephemeral public key E and
identity public key A, call

  x25519_verify(response, challenge, ephemeral, identity);

to verify the response. This checks sG = ± E ± cA and rules out torsion
points. It returns 0 for a valid signature, -1 otherwise.

Schnorr challenges must hash the ephemeral public key as well as the message
to be signed, because the prover must commit before the verifier challenges
in the corresponding sigma protocol. Absorb the ephemeral public key on top
of a duplex state before squeezing a challenge to sign that state.

The signer's public identity should also be absorbed before signing unless
the state is already bound to it. https://eprint.iacr.org/2015/996.pdf
shows that multi-user attacks against key-prefixed Schnorr signatures are
no easier than single-user attacks against unprefixed signatures.

Signatures are malleable: if s' = ± s (mod l) where s is a valid response,
s' is also a valid response. Similarly, a valid signature for a challenge c
verifies for any c' = ± c (mod l). As with all Schnorr signatures, leaking
or reusing an ephemeral secret trivially compromises the identity secret,
and more generally, bias in random key generation across many signatures
will leak key bits.

For deterministic signatures, clone the duplex state after absorbing the
message and public identity, absorb the identity secret on top then squeeze
the ephemeral secret, in the style of Ed25519's hashing. Discard the cloned
state and continue as before, absorbing the ephemeral public key into the
original duplex and squeezing a challenge. This eliminates the risk of
reusing an ephemeral key and the need for unbiased entropy during signing.

See https://eprint.iacr.org/2012/309.pdf for details on STROBE signatures,
and also https://eprint.iacr.org/2017/518.pdf for the qDSA scheme.


Secret sharing
==============

A simple constant-time implementation of Shamir's secret sharing scheme over
bitsliced GF(256) is included, allowing 32-byte keys to be divided between
up to 255 key-holders, then later reconstructed from a sufficiently large
subset of shares.

Prototypes for these operations are in shamir.h and code calling them must
be linked against shamir.c. Copy both files into your tree to use them.


Splitting secrets
-----------------

To generate n 33-byte shares of a 32-byte secret, call

  shamir_split(share, index, threshold, secret, entropy);

for each index = 0, 1, ..., n - 1, where threshold is the minimum number of
shares that will be required to reconstruct the secret. index and threshold
must not exceed 254 and 255 respectively.

The scheme's security depends heavily on the block of 32 * (threshold - 1)
bytes supplied as uint8_t entropy[threshold - 1][32], which is used to set
random coefficients of a polynomial over GF(256). Randomise this before
first calling shamir_split(), leave it unchanged for subsequent invocations,
then securely discard it once all the shares are generated.

Flaws in the randomness of the entropy (or leaks of it) will compromise the
secret.


Reconstructing secrets
----------------------

Given a quorum in uint8_t shares[][33], reconstruct the 32-byte secret with

  shamir_combine(secret, count, shares);

where count is the number of 33-byte shares provided. This will silently
generate an incorrect secret if too few shares are supplied or if they are
inconsistent/corrupt.


Testing and installing the library
==================================

To run the test suite and basic benchmarks, use

  make test

The easiest way to link against this library is to copy the .c and .h files
directly into your source tree. However, an object library can also be built
and installed along with the header files. Run

  make install-shared

to install libpocketcrypt.so into PREFIX/lib and the pocketcrypt .h files
into PREFIX/include/pocketcrypt. Similarly

  make install-static

installs libpocketcrypt.a and the header files. The usual DESTDIR and PREFIX
variables are respected, as well as CC, CFLAGS, BINDIR, INCDIR and LIBDIR
for more detailed control of compilation and install paths.


Copying
=======

x25519.c was originally written by Mike Hamburg as part of the STROBE
protocol framework, and is distributed as Free Software under the terms of
the MIT license by Cryptography Research, Inc.

The rest of the software (including duplex.h) and this documentation were
written by Chris Webb <chris@arachsys.com> and the combined library is
distributed as Free Software under the terms of the MIT license in COPYING.