I am interested in modular arithmetic with respect to the prime $p = 2^{64}-2^{32}+1$. Thomas Pornin has some work on constant time implementation of arithmetic in $\mathsf{GF}(p)$ for this prime (the paper does other things as well --- this is the part relevant to my question).
Using montgomery arithmetic, a constant-time implementation is provided which has measured (and theoretically predicted) performance of
- addition and multiplication are $\approx 4$ clock cycles, and
- multiplication is $\approx 10$ clock cycles.
I'm curious --- if one does not care about the arithmetic being constant time, how much can this be sped up (if at all)? While I care about arithmetic modulo the stated prime specifically, I would of course be interested in general "rule of thumb" answers as well. I am additionally interested in the setting where one has 128-bit hardware arithmetic support.