So I’ve just managed to upstream some changes to OpenSSL for a new strategy I’ve developed for efficient arithmetic used in secp384r1, a curve prescribed by NIST for digital signatures and key exchange. In spite of its prevalence, its implementation in OpenSSL has remained somewhat unoptimised, even as less frequently used curves (P224, P256, P521) each have their own optimisations.
The strategy I have used could be called a 56-bit redundant limb implementation with Solinas reduction. Without too much micro-optimisation, we get ~5.5x speedup over the default (Montgomery Multiplication) implementation for creation of digital signatures.
How is this possible? Well first let’s quickly explain some language:
Elliptic Curves
When it comes to cryptography, it’s highly likely that those with a computer science background will be familiar with ideas such as key-exchange and private-key signing. The stand-in asymmetric cipher in a typical computer science curriculum is typically RSA. However, the heyday of Elliptic Curve ciphers has well and truly arrived, and their operation seems no less mystical than when they were just a toy for academia.
The word ‘Elliptic’ may seem to imply continuous mathematics. As a useful cryptographic problem, we fundamentally are just interested with the algebraic properties of these curves, whose points are elements of a finite field. Irrespective of the underlying finite field, the algebraic properties of the elliptic curve group can be shown to exist by an application of Bézout’s Theorem. The group operator on points on an elliptic curve for a particular choice of field involves the intersection of lines intersecting either once, twice or thrice with the curve, granting notions of addition and doubling for the points of intersection, and giving the ‘point at infinity’ as the group identity. A closed form exists for computing a point double/addition in arbitrary fields (different closed forms can apply, but determined by the field’s characteristic, and the same closed form applies for all large prime fields).
Our algorithm uses a field of the form (mathbb{F}_p), that is the unique field with (p) (a prime) elements. The most straightforward construction of this field is arithmetic modulo (p). The other finite fields used in practise in ECC are of the form (mathbb{F}_{2^m}) and are sometimes called ‘binary fields’ (representible as polynomials with binary coefficients). Their field structure is also used in AES through byte substitution, implemented by inversion modulo (mathbb{F}_{2^8}).
From a performance perspective, great optimisations can be made by implementing efficient fixed-point arithmetic specialised to modulo by single prime constant, (p). From here on out, I’ll be speaking from this abstraction layer alone.
Limbs
We wish to multiply two (m)-bit numbers, each of which represented with (n) 64-bit machine words in some way. Let’s suppose just for now that (n) divides (m) neatly, then the quotient (d) is the minimum number of bits in each machine word that will be required for representing our number. Suppose we use the straightforward representation whereby the least significant (d) bits are used for storing parts of our number, which we better call (x) because this is crypto and descriptive variable names are considered harmful (apparently).
$$x = sum_{k = 0}^{n-1} 2^{dk} l_k$$
If we then drop the requirement for each of our (n) machine words (also referred to as a ‘limb’ from hereon out) to have no more than the least significant (d) bits populated, we say that such an implementation uses ‘redundant limbs’, meaning that the (k)-th limb has high bits which overlap with the place values represented in the ((k+1))-th limb.
Multiplication (mod p)
The fundamental difficulty with making modulo arithmetic fast is to do with the following property of multiplication.
Let (a) and (b) be (m)-bit numbers, then (0 leq a < 2^m) and (0 leq b < 2^m), but critically we cannot say the same about (ab). Instead, the best we can say is that (0 leq ab < 2^{2m}). Multiplication can in the worst case double the number of bits that must be stored, unless we can reduce modulo our prime.
If we begin with non-redundant, 56-bit limbs, then for (a) and (b) not too much larger than (2^{384} > p_{384}) that are ‘reduced sufficiently’ then we can multiply our limbs in the following ladder, so long as we are capable of storing the following sums without overflow.
/* and so on ... */
out[5] = ((uint128_t) in1[0]) * in2[5]
+ ((uint128_t) in1[1]) * in2[4]
+ ((uint128_t) in1[2]) * in2[3]
+ ((uint128_t) in1[3]) * in2[2]
+ ((uint128_t) in1[4]) * in2[1]
+ ((uint128_t) in1[5]) * in2[0];
out[6] = ((uint128_t) in1[0]) * in2[6]
+ ((uint128_t) in1[1]) * in2[5]
+ ((uint128_t) in1[2]) * in2[4]
+ ((uint128_t) in1[3]) * in2[3]
+ ((uint128_t) in1[4]) * in2[2]
+ ((uint128_t) in1[5]) * in2[1]
+ ((uint128_t) in1[6]) * in2[0];
out[7] = ((uint128_t) in1[1]) * in2[6]
+ ((uint128_t) in1[2]) * in2[5]
+ ((uint128_t) in1[3]) * in2[4]
+ ((uint128_t) in1[4]) * in2[3]
+ ((uint128_t) in1[5]) * in2[2]
+ ((uint128_t) in1[6]) * in2[1];
out[8] = ((uint128_t) in1[2]) * in2[6]
+ ((uint128_t) in1[3]) * in2[5]
+ ((uint128_t) in1[4]) * in2[4]
+ ((uint128_t) in1[5]) * in2[3]
+ ((uint128_t) in1[6]) * in2[2];
/* ... and so forth */
This is possible, if we back each of the 56-bit limbs with a 64-bit machine word, with products being stored in 128-bit machine words. The numbers (a) and (b) were able to be stored with 7 limbs, whereas we use 13 limbs for storing the product. If (a) and (b) were stored non-redundantly, than each of the output (redundant) limbs must contain values less than (6 cdot 2^{56} cdot 2^{56} < 2^{115}), so there is no possibility of overflow in 128 bits. We even have room spare to do some additions/subtractions in cheap, redundant limb arithmetic.
But we can’t keep doing our sums in redundant limb arithmetic forever, we must eventually reduce. Doing so may be expensive, and so we would rather reduce only when strictly necessary!
Solinas-ish Reduction
Our prime is a Solinas (Pseudo/Generalised-Mersenne) Prime. Mersenne Primes are primes expressible as (2^m – 1). This can be generalised to low-degree polynomials in (2^m). For example, another NIST curve uses (p_{224} = 2^{224} – 2^{96} + 1) (a 224-bit number) where (p_{224} = f(2^{32})) for (f(t) = t^7 – t^3 + 1). The simpler the choice of polynomial, the simpler the modular reduction logic.
Our choice of (t) is (2^{56}). Wikipedia the ideal case for Solinas reduction where the bitwidth of the prime is divisible by (log_2{t}), but that is not our scenario. We choose 56-bits for some pretty simple realities of hardware. 56 is less than 64 (standard machine word size) but not by too much, and the difference is byte-addressible ((64-56=8)). Let me explain: