For 256 bits "Double-and-add" multiplication uses about 128 addings and 256 doublings, where "Montgomery ladder" uses near 256 addings and 256 doublings. "Double-and-add" in my Intel(R) Core(TM) i3-8130U CPU @ 2.20GHz takes 0.02 s, it is normal or too big?
In adding we compute $\lambda = \frac{y_q - y_p}{x_q - x_p}$. Main cost is computing inverse of $x_q - x_p$ and inverse uses extended_gcd and modulo:
def extended_gcd(aa, bb):
lastremainder, remainder = abs(aa), abs(bb)
x, lastx, y, lasty = 0, 1, 1, 0
while remainder:
lastremainder, (quotient, remainder) = remainder, divmod(lastremainder, remainder)
x, lastx = lastx - quotient * x, x
y, lasty = lasty - quotient * y, y
return lastremainder, lastx * (-1 if aa < 0 else 1), lasty * (-1 if bb < 0 else 1)
calculate modular inverse
def modinv(a, m):
g, x, y = extended_gcd(a, m)
if g != 1:
raise ValueError
return x % m
I must many times compute this inverse to obtain $\lambda$ or between [n]P and [n+1]P is the same $\lambda$ (or can be fast comnputed) as is between [n+1] and [n+2]P ?