Faster residue multiplication modulo 521-bit mersenne prime and application to ECC

Ali, Shoukat
We present faster algorithms for the residue multiplication modulo 521-bit Mersenne prime on 32- and 64-bit platforms by using Toeplitz Matrix-Vector Product (TMVP). The total arithmetic cost of our proposed algorithms is less than the existing algorithms and we select the ones, 32- and 64-bit residue multiplication, with the best timing results on our testing machine(s). For the 64-bit residue multiplication we have presented three versions of our algorithm along with their arithmetic cost and from implementation point of view, we provide the timing results of each version. The transition from 64- to 32-bit residue multiplication is full of challenges because the number of limbs becomes double and the bitlength of the limbs reduces by half. We propose three technique for 32-bit residue multiplication such that both the arithmetic cost and the timing results of each one is provided. Without use of any intrinsics and SIMD/assembly instructions in our implementation, on Intel(R) Core i5 -- 6402P CPU @ 2:80GHz, we find 136- and 550-cycle for our 64- and 32-bit residue multiplications, respectively. We implement constant-time variable- and fixed-base scalar multiplication on the standard NIST curve P-521 and Edwards curve E-521. Using our residue multiplication(s), we find E-521 more efficient than P-521 especially for variable-base scalar multiplication.