mystuff/net/gurk-rs/files/vendor/curve25519-dalek-2.0.0/docs/ifma-notes.md

26 KiB

An AVX512-IFMA implementation of the vectorized point operation strategy.

IFMA instructions

AVX512-IFMA is an extension to AVX-512 consisting of two instructions:

  • vpmadd52luq: packed multiply of unsigned 52-bit integers and add the low 52 product bits to 64-bit accumulators;
  • vpmadd52huq: packed multiply of unsigned 52-bit integers and add the high 52 product bits to 64-bit accumulators;

These operate on 64-bit lanes of their source vectors, taking the low 52 bits of each lane of each source vector, computing the 104-bit products of each pair, and then adding either the high or low 52 bits of the 104-bit products to the 64-bit lanes of the destination vector. The multiplication is performed internally by reusing circuitry for floating-point arithmetic. Although these instructions are part of AVX512, the AVX512VL (vector length) extension (present whenever IFMA is) allows using them with 512, 256, or 128-bit operands.

This provides a major advantage to vectorized integer operations: previously, vector operations could only use a \(32 \times 32 \rightarrow 64\)-bit multiplier, while serial code could use a \(64\times 64 \rightarrow 128\)-bit multiplier.

IFMA for big-integer multiplications

A detailed example of the intended use of the IFMA instructions can be found in a 2016 paper by Gueron and Krasnov, Accelerating Big Integer Arithmetic Using Intel IFMA Extensions. The basic idea is that multiplication of large integers (such as 1024, 2048, or more bits) can be performed as follows.

First, convert a “packed” 64-bit representation \[ \begin{aligned} x &= x'_0 + x'_1 2^{64} + x'_2 2^{128} + \cdots \\ y &= y'_0 + y'_1 2^{64} + y'_2 2^{128} + \cdots \end{aligned} \] into a “redundant” 52-bit representation \[ \begin{aligned} x &= x_0 + x_1 2^{52} + x_2 2^{104} + \cdots \\ y &= y_0 + y_1 2^{52} + y_2 2^{104} + \cdots \end{aligned} \] with each \(x_i, y_j\) in a 64-bit lane.

Writing the product as \(z = z_0 + z_1 2^{52} + z_2 2^{104} + \cdots\), the “schoolbook” multiplication strategy gives \[ \begin{aligned} &z_0 &&=& x_0 & y_0 & & & & & & & & \\ &z_1 &&=& x_1 & y_0 &+ x_0 & y_1 & & & & & & \\ &z_2 &&=& x_2 & y_0 &+ x_1 & y_1 &+ x_0 & y_2 & & & & \\ &z_3 &&=& x_3 & y_0 &+ x_2 & y_1 &+ x_1 & y_2 &+ x_0 & y_3 & & \\ &z_4 &&=& \vdots\;&\;\vdots &+ x_3 & y_1 &+ x_2 & y_2 &+ x_1 & y_3 &+ \cdots& \\ &z_5 &&=& & & \vdots\;&\;\vdots &+ x_3 & y_2 &+ x_2 & y_3 &+ \cdots& \\ &z_6 &&=& & & & & \vdots\;&\;\vdots &+ x_3 & y_3 &+ \cdots& \\ &z_7 &&=& & & & & & & \vdots\;&\;\vdots &+ \cdots& \\ &\vdots&&=& & & & & & & & & \ddots& \\ \end{aligned} \] Notice that the product coefficient \(z_k\), representing the value \(z_k 2^{52k}\), is the sum of all product terms \( (x_i 2^{52 i}) (y_j 2^{52 j}) \) with \(k = i + j\).
Write the IFMA operators \(\mathrm{lo}(a,b)\), denoting the low \(52\) bits of \(ab\), and \(\mathrm{hi}(a,b)\), denoting the high \(52\) bits of \(ab\).
Now we can rewrite the product terms as \[ \begin{aligned} (x_i 2^{52 i}) (y_j 2^{52 j}) &= 2^{52 (i+j)}( \mathrm{lo}(x_i, y_j) + \mathrm{hi}(x_i, y_j) 2^{52} ) \\ &= \mathrm{lo}(x_i, y_j) 2^{52 (i+j)} + \mathrm{hi}(x_i, y_j) 2^{52 (i+j+1)}. \end{aligned} \] This means that the low half of \(x_i y_j\) can be accumulated onto the product limb \(z_{i+j}\) and the high half can be directly accumulated onto the next-higher product limb \(z_{i+j+1}\) with no additional operations. This allows rewriting the schoolbook multiplication into the form \[ \begin{aligned} &z_0 &&=& \mathrm{lo}(x_0,&y_0) & & & & & & & & & & \\ &z_1 &&=& \mathrm{lo}(x_1,&y_0) &+\mathrm{hi}(x_0,&y_0) &+\mathrm{lo}(x_0,&y_1) & & & & & & \\ &z_2 &&=& \mathrm{lo}(x_2,&y_0) &+\mathrm{hi}(x_1,&y_0) &+\mathrm{lo}(x_1,&y_1) &+\mathrm{hi}(x_0,&y_1) &+\mathrm{lo}(x_0,&y_2) & & \\ &z_3 &&=& \mathrm{lo}(x_3,&y_0) &+\mathrm{hi}(x_2,&y_0) &+\mathrm{lo}(x_2,&y_1) &+\mathrm{hi}(x_1,&y_1) &+\mathrm{lo}(x_1,&y_2) &+ \cdots& \\ &z_4 &&=& \vdots\;&\;\vdots &+\mathrm{hi}(x_3,&y_0) &+\mathrm{lo}(x_3,&y_1) &+\mathrm{hi}(x_2,&y_1) &+\mathrm{lo}(x_2,&y_2) &+ \cdots& \\ &z_5 &&=& & & \vdots\;&\;\vdots & \vdots\;&\;\vdots &+\mathrm{hi}(x_3,&y_1) &+\mathrm{lo}(x_3,&y_2) &+ \cdots& \\ &z_6 &&=& & & & & & & \vdots\;&\;\vdots & \vdots\;&\;\vdots &+ \cdots& \\ &\vdots&&=& & & & & & & & & & & \ddots& \\ \end{aligned} \] Gueron and Krasnov implement multiplication by constructing vectors out of the columns of this diagram, so that the source operands for the IFMA instructions are of the form \((x_0, x_1, x_2, \ldots)\) and \((y_i, y_i, y_i, \ldots)\). After performing the multiplication, the product terms \(z_i\) are then repacked into a 64-bit representation.

An alternative strategy

The strategy described above is aimed at big-integer multiplications, such as 1024, 2048, or 4096 bits, which would be used for applications like RSA. However, elliptic curve cryptography uses much smaller field sizes, such as 256 or 384 bits, so a different strategy is needed.

The parallel Edwards formulas provide parallelism at the level of the formulas for curve operations. This means that instead of scanning through the terms of the source operands and parallelizing within a field element (as described above), we can arrange the computation in product-scanning form and parallelize across field elements (as described below).

The parallel Edwards formulas provide 4-way parallelism, so they can be implemented using 256-bit vectors using a single 64-bit lane for each element, or using 512-bit vectors using two 64-bit lanes. The only available CPU supporting IFMA (the i3-8121U) executes 512-bit IFMA instructions at half rate compared to 256-bit instructions, so for now there's no throughput advantage to using 512-bit IFMA instructions, and this implementation uses 256-bit vectors.

To extend this to 512-bit vectors, it's only only necessary to achieve 2-way parallelism, and it's possible (with a small amount of overhead) to create a hybrid strategy that operates entirely within 128-bit lanes. This means that cross-lane operations can use the faster vpshufd (1c latency) instead of a general shuffle instruction (3c latency).

Choice of radix

The inputs to IFMA instructions are 52 bits wide, so the radix \(r\) used to represent a multiprecision integer must be \( r \leq 52 \). The obvious choice is the "native" radix \(r = 52\).

As described above, this choice has the advantage that for \(x_i, y_j \in [0,2^{52})\), the product term \[ \begin{aligned} (x_i 2^{52 i}) (y_j 2^{52 j}) &= 2^{52 (i+j)}( \mathrm{lo}(x_i, y_j) + \mathrm{hi}(x_i, y_j) 2^{52} ) \\ &= \mathrm{lo}(x_i, y_j) 2^{52 (i+j)} + \mathrm{hi}(x_i, y_j) 2^{52 (i+j+1)}, \end{aligned} \] so that the low and high halves of the product can be directly accumulated onto the product limbs. In contrast, when using a smaller radix \(r = 52 - k\), the product term has the form \[ \begin{aligned} (x_i 2^{r i}) (y_j 2^{r j}) &= 2^{r (i+j)}( \mathrm{lo}(x_i, y_j) + \mathrm{hi}(x_i, y_j) 2^{52} ) \\ &= \mathrm{lo}(x_i, y_j) 2^{r (i+j)} + ( \mathrm{hi}(x_i, y_j) 2^k ) 2^{r (i+j+1)}. \end{aligned} \] What's happening is that the product \(x_i y_j\) of size \(2r\) bits is split not at \(r\) but at \(52\), so \(k\) product bits are placed into the low half instead of the high half. This means that the high half of the product cannot be directly accumulated onto \(z_{i+j+1}\), but must first be multiplied by \(2^k\) (i.e., left shifted by \(k\)). In addition, the low half of the product is \(52\) bits large instead of \(r\) bits.

Handling offset product terms

Drucker and Gueron analyze the choice of radix in the context of big-integer squaring, outlining three ways to handle the offset product terms, before concluding that all of them are suboptimal:

  1. Shift the results after accumulation;
  2. Shift the input operands before multiplication;
  3. Split the MAC operation, accumulating into a zeroed register, shifting the result, and then adding.

The first option is rejected because it could double-shift some previously accumulated terms, the second doesn't work because the inputs could become larger than \(52\) bits, and the third requires additional instructions to handle the shifting and adding.

Based on an analysis of total number of instructions, they suggest an addition to the instruction set, which they call FMSA (fused multiply-shift-add). This would shift the result according to an 8-bit immediate value before accumulating it into the destination register.

However, this change to the instruction set doesn't seem to be necessary. Instead, the product terms can be grouped according to their coefficients, accumulated together, then shifted once before adding them to the final sum. This uses an extra register, shift, and add, but only once per product term (accumulation target), not once per source term (as in the Drucker-Gueron paper).

Moreover, because IFMA instructions execute only on two ports (presumably 0 and 1), while adds and shifts can execute on three ports (0, 1, and 5), the adds and shifts can execute independently of the IFMA operations, as long as there is not too much pressure on port 5. This means that, although the total number of instructions increases, the shifts and adds do not necessarily increase the execution time, as long as throughput is limited by IFMA operations.

Finally, because IFMA instructions have 4 cycle latency and 0.5/1 cycle throughput (for 256/512 bit vectors), maximizing IFMA throughput requires either 8 (for 256) or 4 (for 512) independent operations. So accumulating groups of terms independently before adding them at the end may be necessary anyways, in order to prevent long chains of dependent instructions.

Advantages of a smaller radix

Using a smaller radix has other advantages. Although radix \(52\) is an unsaturated representation from the point of view of the \(64\)-bit accumulators (because up to 4096 product terms can be accumulated without carries), it's a saturated representation from the point of view of the multiplier (since \(52\)-bit values are the maximum input size).

Because the inputs to a multiplication must have all of their limbs bounded by \(2^{52}\), limbs in excess of \(2^{52}\) must be reduced before they can be used as an input. The Gueron-Krasnov paper suggests normalizing values using a standard, sequential carry chain: for each limb, add the carryin from reducing the previous limb, compute the carryout and reduce the current limb, then move to the next limb.

However, when using a smaller radix, such as \(51\), each limb can store a carry bit and still be used as the input to a multiplication. This means that the inputs do not need to be normalized, and instead of using a sequential carry chain, we can compute all carryouts in parallel, reduce all limbs in parallel, and then add the carryins in parallel (possibly growing the limb values by one bit).

Because the output of this partial reduction is an acceptable multiplication input, we can "close the loop" using partial reductions and never have to normalize to a canonical representation through the entire computation, in contrast to the Gueron-Krasnov approach, which converts back to a packed representation after every operation. (This idea seems to trace back to at least as early as this 1999 paper).

Using \(r = 51\) is enough to keep a carry bit in each limb and avoid normalizations. What about an even smaller radix? One reason to choose a smaller radix would be to align the limb boundaries with an inline reduction (for instance, choosing \(r = 43\) for the Mersenne field \(p = 2^{127} - 1\)), but for \(p = 2^{255 - 19}\), \(r = 51 = 255/5\) is the natural choice.

Multiplication

The inputs to a multiplication are two field elements \[ \begin{aligned} x &= x_0 + x_1 2^{51} + x_2 2^{102} + x_3 2^{153} + x_4 2^{204} \\ y &= y_0 + y_1 2^{51} + y_2 2^{102} + y_3 2^{153} + y_4 2^{204}, \end{aligned} \] with limbs in range \([0,2^{52})\).

Writing the product terms as \[ \begin{aligned} z &= z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204} \\ &+ z_5 2^{255} + z_6 2^{306} + z_7 2^{357} + z_8 2^{408} + z_9 2^{459}, \end{aligned} \] a schoolbook multiplication in product scanning form takes the form \[ \begin{aligned} z_0 &= x_0 y_0 \\ z_1 &= x_1 y_0 + x_0 y_1 \\ z_2 &= x_2 y_0 + x_1 y_1 + x_0 y_2 \\ z_3 &= x_3 y_0 + x_2 y_1 + x_1 y_2 + x_0 y_3 \\ z_4 &= x_4 y_0 + x_3 y_1 + x_2 y_2 + x_1 y_3 + x_0 y_4 \\ z_5 &= x_4 y_1 + x_3 y_2 + x_2 y_3 + x_1 y_4 \\ z_6 &= x_4 y_2 + x_3 y_3 + x_2 y_4 \\ z_7 &= x_4 y_3 + x_3 y_4 \\ z_8 &= x_4 y_4 \\ z_9 &= 0 \\ \end{aligned} \] Each term \(x_i y_j\) can be written in terms of IFMA operations as \[ x_i y_j = \mathrm{lo}(x_i,y_j) + 2\mathrm{hi}(x_i,y_j)2^{51}. \] Substituting this equation into the schoolbook multiplication, then moving terms to eliminate the \(2^{51}\) factors gives \[ \begin{aligned} z_0 &= \mathrm{lo}(x_0, y_0) \\ &+ \qquad 0 \\ z_1 &= \mathrm{lo}(x_1, y_0) + \mathrm{lo}(x_0, y_1) \\ &+ \qquad 2( \mathrm{hi}(x_0, y_0) )\\ z_2 &= \mathrm{lo}(x_2, y_0) + \mathrm{lo}(x_1, y_1) + \mathrm{lo}(x_0, y_2) \\ &+ \qquad 2( \mathrm{hi}(x_1, y_0) + \mathrm{hi}(x_0, y_1) )\\ z_3 &= \mathrm{lo}(x_3, y_0) + \mathrm{lo}(x_2, y_1) + \mathrm{lo}(x_1, y_2) + \mathrm{lo}(x_0, y_3) \\ &+ \qquad 2( \mathrm{hi}(x_2, y_0) + \mathrm{hi}(x_1, y_1) + \mathrm{hi}(x_0, y_2) )\\ z_4 &= \mathrm{lo}(x_4, y_0) + \mathrm{lo}(x_3, y_1) + \mathrm{lo}(x_2, y_2) + \mathrm{lo}(x_1, y_3) + \mathrm{lo}(x_0, y_4) \\ &+ \qquad 2( \mathrm{hi}(x_3, y_0) + \mathrm{hi}(x_2, y_1) + \mathrm{hi}(x_1, y_2) + \mathrm{hi}(x_0, y_3) )\\ z_5 &= \mathrm{lo}(x_4, y_1) + \mathrm{lo}(x_3, y_2) + \mathrm{lo}(x_2, y_3) + \mathrm{lo}(x_1, y_4) \\ &+ \qquad 2( \mathrm{hi}(x_4, y_0) + \mathrm{hi}(x_3, y_1) + \mathrm{hi}(x_2, y_2) + \mathrm{hi}(x_1, y_3) + \mathrm{hi}(x_0, y_4) )\\ z_6 &= \mathrm{lo}(x_4, y_2) + \mathrm{lo}(x_3, y_3) + \mathrm{lo}(x_2, y_4) \\ &+ \qquad 2( \mathrm{hi}(x_4, y_1) + \mathrm{hi}(x_3, y_2) + \mathrm{hi}(x_2, y_3) + \mathrm{hi}(x_1, y_4) )\\ z_7 &= \mathrm{lo}(x_4, y_3) + \mathrm{lo}(x_3, y_4) \\ &+ \qquad 2( \mathrm{hi}(x_4, y_2) + \mathrm{hi}(x_3, y_3) + \mathrm{hi}(x_2, y_4) )\\ z_8 &= \mathrm{lo}(x_4, y_4) \\ &+ \qquad 2( \mathrm{hi}(x_4, y_3) + \mathrm{hi}(x_3, y_4) )\\ z_9 &= 0 \\ &+ \qquad 2( \mathrm{hi}(x_4, y_4) )\\ \end{aligned} \] As noted above, our strategy will be to multiply and accumulate the terms with coefficient \(2\) separately from those with coefficient \(1\), before combining them at the end. This can alternately be thought of as accumulating product terms into a doubly-redundant representation, with two limbs for each digit, before collapsing the doubly-redundant representation by shifts and adds.

This computation requires 25 vpmadd52luq and 25 vpmadd52huq operations. For 256-bit vectors, IFMA operations execute on an i3-8121U with latency 4 cycles, throughput 0.5 cycles, so executing 50 instructions requires 25 cycles' worth of throughput. Accumulating terms with coefficient \(1\) and \(2\) seperately means that the longest dependency chain has length 5, so the critical path has length 20 cycles and the bottleneck is throughput.

Reduction modulo \(p\)

The next question is how to handle the reduction modulo \(p\). Because \(p = 2^{255} - 19\), \(2^{255} = 19 \pmod p\), so we can alternately write \[ \begin{aligned} z &= z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204} \\ &+ z_5 2^{255} + z_6 2^{306} + z_7 2^{357} + z_8 2^{408} + z_9 2^{459} \end{aligned} \] as \[ \begin{aligned} z &= (z_0 + 19z_5) + (z_1 + 19z_6) 2^{51} + (z_2 + 19z_7) 2^{102} + (z_3 + 19z_8) 2^{153} + (z_4 + 19z_9) 2^{204}. \end{aligned} \] When using a \(64 \times 64 \rightarrow 128\)-bit multiplier, this can be handled (as in Ed25519) by premultiplying source terms by \(19\). Since \(\lg(19) < 4.25\), this increases their size by less than \(4.25\) bits, and the rest of the multiplication can be shown to work out.

Here, we have at most \(1\) bit of headroom. In order to allow premultiplication, we would need to use radix \(2^{47}\), which would require six limbs instead of five. Instead, we compute the high terms \(z_5, \ldots, z_9\), each using two chains of IFMA operations, then multiply by \(19\) and combine with the lower terms \(z_0, \ldots, z_4\). There are two ways to perform the multiplication by \(19\): using more IFMA operations, or using the vpmullq instruction, which computes the low \(64\) bits of a \(64 \times 64\)-bit product. However, vpmullq has 15c/1.5c latency/throughput, in contrast to the 4c/0.5c latency/throughput of IFMA operations, so it seems like a worse choice.

The high terms \(z_5, \ldots, z_9\) are sums of \(52\)-bit terms, so they are larger than \(52\) bits. Write these terms in radix \(52\) as \[ z_{5+i} = z_{5+i}' + z_{5+i}'' 2^{52}, \qquad z_{5+i}' < 2^{52}. \] Then the contribution of \(z_{5+i}\), taken modulo \(p\), is \[ \begin{aligned} z_{5+i} 2^{255} 2^{51 i} &= 19 (z_{5+i}' + z_{5+i}'' 2^{52}) 2^{51 i} \\ &= 19 z_{5+i}' 2^{51 i} + 2 \cdot 19 z_{5+i}'' 2^{51 (i+1)} \\ \end{aligned} \] The products \(19 z_{5+i}', 19 z_{5+i}''\) can be written in terms of IFMA operations as \[ \begin{aligned} 19 z_{5+i}' &= \mathrm{lo}(19, z_{5+i}') + 2 \mathrm{hi}(19, z_{5+i}') 2^{51}, \\ 19 z_{5+i}'' &= \mathrm{lo}(19, z_{5+i}'') + 2 \mathrm{hi}(19, z_{5+i}'') 2^{51}. \\ \end{aligned} \] Because \(z_{5+i} < 2^{64}\), \(z_{5+i}'' < 2^{12} \), so \(19 z_{5+i}'' < 2^{17} < 2^{52} \) and \(\mathrm{hi}(19, z_{5+i}'') = 0\). Because IFMA operations ignore the high bits of their source operands, we do not need to compute \(z_{5+i}'\) explicitly: the high bits will be ignored. Combining these observations, we can write \[ \begin{aligned} z_{5+i} 2^{255} 2^{51 i} &= 19 z_{5+i}' 2^{51 i} + 2 \cdot 19 z_{5+i}'' 2^{51 (i+1)} \\ &= \mathrm{lo}(19, z_{5+i}) 2^{51 i} + 2 \mathrm{hi}(19, z_{5+i}) 2^{51 (i+1)} + 2 \mathrm{lo}(19, z_{5+i}/2^{52}) 2^{51 (i+1)}. \end{aligned} \]

For \(i = 0,1,2,3\), this allows reducing \(z_{5+i}\) onto \(z_{i}, z_{i+1}\), and if the low terms are computed using a doubly-redundant representation, no additional shifts are needed to handle the \(2\) coefficients. For \(i = 4\), there's a complication: the contribution becomes \[ \begin{aligned} z_{9} 2^{255} 2^{204} &= \mathrm{lo}(19, z_{9}) 2^{204} + 2 \mathrm{hi}(19, z_{9}) 2^{255} + 2 \mathrm{lo}(19, z_{9}/2^{52}) 2^{255} \\ &= \mathrm{lo}(19, z_{9}) 2^{204} + 2 \mathrm{hi}(19, z_{9}) 19 + 2 \mathrm{lo}(19, z_{9}/2^{52}) 19 \\ &= \mathrm{lo}(19, z_{9}) 2^{204} + 2 \mathrm{lo}(19, \mathrm{hi}(19, z_{9}) + \mathrm{lo}(19, z_{9}/2^{52})). \\ \end{aligned} \]

It would be possible to cut the number of multiplications from 3 to 2 by carrying the high part of each \(z_i\) onto \(z_{i+1}\). This would eliminate 5 multiplications, clearing 2.5 cycles of port pressure, at the cost of 5 additions, adding 1.66 cycles of port pressure. But doing this would create a dependency between terms (e.g., \(z_{5}\) must be computed before the reduction of \(z_{6}\) can begin), whereas with the approach above, all contributions to all terms are computed independently, to maximize ILP and flexibility for the processor to schedule instructions.

This strategy performs 16 IFMA operations, adding two IFMA operations to each of the \(2\)-coefficient terms and one to each of the \(1\)-coefficient terms. Considering the multiplication and reduction together, we use 66 IFMA operations, requiring 33 cycles' throughput, while the longest chain of IFMA operations is in the reduction of \(z_5\) onto \(z_1\), of length 7 (so 28 cycles, plus 2 cycles to combine the two parts of \(z_5\), and the bottleneck is again throughput.

Once this is done, we have computed the product terms \[ z = z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204}, \] without reducing the \(z_i\) to fit in \(52\) bits. Because the overall flow of operations alternates multiplications and additions or subtractions, we would have to perform a reduction after an addition but before the next multiplication anyways, so there's no benefit to fully reducing the limbs at the end of a multiplication. Instead, we leave them unreduced, and track the reduction state using the type system to ensure that unreduced limbs are not accidentally used as an input to a multiplication.

Squaring

Squaring operates similarly to multiplication, but with the possibility to combine identical terms. As before, we write the input as \[ \begin{aligned} x &= x_0 + x_1 2^{51} + x_2 2^{102} + x_3 2^{153} + x_4 2^{204} \end{aligned} \] with limbs in range \([0,2^{52})\). Writing the product terms as \[ \begin{aligned} z &= z_0 + z_1 2^{51} + z_2 2^{102} + z_3 2^{153} + z_4 2^{204} \\ &+ z_5 2^{255} + z_6 2^{306} + z_7 2^{357} + z_8 2^{408} + z_9 2^{459}, \end{aligned} \] a schoolbook squaring in product scanning form takes the form \[ \begin{aligned} z_0 &= x_0 x_0 \\ z_1 &= 2 x_1 x_0 \\ z_2 &= 2 x_2 x_0 + x_1 x_1 \\ z_3 &= 2 x_3 x_0 + 2 x_2 x_1 \\ z_4 &= 2 x_4 x_0 + 2 x_3 x_1 + x_2 x_2 \\ z_5 &= 2 x_4 x_1 + 2 x_3 x_2 \\ z_6 &= 2 x_4 x_2 + x_3 x_3 \\ z_7 &= 2 x_4 x_3 \\ z_8 &= x_4 x_4 \\ z_9 &= 0 \\ \end{aligned} \] As before, we write \(x_i x_j\) as \[ x_i x_j = \mathrm{lo}(x_i,x_j) + 2\mathrm{hi}(x_i,x_j)2^{51}, \] and substitute to obtain \[ \begin{aligned} z_0 &= \mathrm{lo}(x_0, x_0) + 0 \\ z_1 &= 2 \mathrm{lo}(x_1, x_0) + 2 \mathrm{hi}(x_0, x_0) \\ z_2 &= 2 \mathrm{lo}(x_2, x_0) + \mathrm{lo}(x_1, x_1) + 4 \mathrm{hi}(x_1, x_0) \\ z_3 &= 2 \mathrm{lo}(x_3, x_0) + 2 \mathrm{lo}(x_2, x_1) + 4 \mathrm{hi}(x_2, x_0) + 2 \mathrm{hi}(x_1, x_1) \\ z_4 &= 2 \mathrm{lo}(x_4, x_0) + 2 \mathrm{lo}(x_3, x_1) + \mathrm{lo}(x_2, x_2) + 4 \mathrm{hi}(x_3, x_0) + 4 \mathrm{hi}(x_2, x_1) \\ z_5 &= 2 \mathrm{lo}(x_4, x_1) + 2 \mathrm{lo}(x_3, x_2) + 4 \mathrm{hi}(x_4, x_0) + 4 \mathrm{hi}(x_3, x_1) + 2 \mathrm{hi}(x_2, x_2) \\ z_6 &= 2 \mathrm{lo}(x_4, x_2) + \mathrm{lo}(x_3, x_3) + 4 \mathrm{hi}(x_4, x_1) + 4 \mathrm{hi}(x_3, x_2) \\ z_7 &= 2 \mathrm{lo}(x_4, x_3) + 4 \mathrm{hi}(x_4, x_2) + 2 \mathrm{hi}(x_3, x_3) \\ z_8 &= \mathrm{lo}(x_4, x_4) + 4 \mathrm{hi}(x_4, x_3) \\ z_9 &= 0 + 2 \mathrm{hi}(x_4, x_4) \\ \end{aligned} \] To implement these, we group terms by their coefficient, computing those with coefficient \(2\) on set of IFMA chains, and on another set of chains, we begin with coefficient-\(4\) terms, then shift left before continuing with the coefficient-\(1\) terms. The reduction strategy is the same as for multiplication.

Future improvements

LLVM won't use blend operations on 256-bit vectors yet, so there's a bunch of blend instructions that could be omitted.

Although the multiplications and squarings are much faster, there's no speedup to the additions and subtractions, so there are diminishing returns. In fact, the complications in the doubling formulas mean that doubling is actually slower than readdition. This also suggests that moving to 512-bit vectors won't be much help for a strategy aimed at parallelism within a group operation, so to extract performance gains from 512-bit vectors it will probably be necessary to create a parallel-friendly multiscalar multiplication algorithm. This could also help with reducing shuffle pressure.

The squaring implementation could probably be optimized, but without perf support on Cannonlake it's difficult to make actual measurements.

Another improvement would be to implement vectorized square root computations, which would allow creating an iterator adaptor for point decompression that bunched decompression operations and executed them in parallel. This would accelerate batch verification.