how to iteratively calculate a^emod n with modulus n sized 4096 bits

Question

In most sites the exponent of the RSA public key is 24 bits. But the modulus can get to 4096 bits size. I have an accelerator that can get max. 2112 bit size modulus. It calculates a^e mod n.

Is there a way to separate $n$ (maybe bit wise) and use the accelerator in several iterations to calculate the wanted a^e mod n?

Accelerating public key operations sees a bit unusual. Typically it's the private key operation that needs acceleration since it's 100x as expensive. If performance of public key operations is really an issue, I'd consider using e=3 — CodesInChaos, Oct 01 '13 at 14:17
@CodesInChaos: I believe he is getting the public keys from somewhere else, hence they are not under his control — poncho, Oct 01 '13 at 14:20
@CodesInChaos wouldn't there be some security concerns for e=3? — Maarten Bodewes, Oct 02 '13 at 15:21
Actually, the exponent is normally 17 bits (65537 or 0x010001) in an unsigned representation, but it is normally padded to 24 bits to fit into an N number of bytes. — Maarten Bodewes, Oct 02 '13 at 15:23
@owlstead I'm not aware of any issues, provided proper padding is used. — CodesInChaos, Oct 02 '13 at 21:09

score 6 · Answer 1 · answered Oct 01 '13 at 15:42

Usually, RSA operations with the public exponent are fast, precisely because the public exponent is short. Hardware accelerators are meant to speed up operations with the private key, which are in much bigger need of it. In particular, hardware accelerators do not need to be "full width" because private key operations use the private key, which contains the factors $p$ and $q$, and allow for doing most of the work modulo $p$ and modulo $q$, both of which being twice shorter than the modulus $n$ (what the CRT is about).

A side note is that some implementation methods of exponentiations are fast but imply some fixed startup and wrapup overheads; e.g. Montgomery multiplication is popular to speed up exponentiations (which consists of many multiplications in a row), but requires converting source values into a special representation, and back at the end, both operations having a non-negligible cost. For a private key operation, the exponentiations used big exponents, and the savings implied by Montgomery multiplications dwarf these overheads, but this might not be true for a short exponent. Therefore, even if your accelerator did support a 4096-bit modulus, it is unclear whether this would have really provided some speedup for your public-key operations.

Anyway, I see no easy way to "split" a 4096-bit exponentiation into several exponentiations modulo smaller values, unless you factor the 4096-bit modulus, which should be infeasible.

+1 for noting that Montgomery multiplication can be a dog when it comes to RSA public key exponentiation. One day I'll write that article about how it is not a significant time saver even for RSA private key exponentiation, compared to carefully crafted code / algorithm for classical modular reduction. — fgrieu, Oct 01 '13 at 16:27
Having implemented both classical modular reduction and Montgomery multiplication, I would claim that Montgomery multiplication is a huge time saver: it saves development time. As for execution time, yeah, quadratic is quadratic and classical modular reduction also works on a per-word basis. — Thomas Pornin, Oct 02 '13 at 21:28

fgrieu · Answer 2 · 2013-10-01T16:31:45.373

I am not aware of any method that would let one make good use of a black box or API computing $f(a,e,n)=a^e\bmod n$ for $n$ of up to $2112$ bits, to efficiently compute $f(a,e,n)=a^e\bmod n$ with $n$ above that bound (like $4096$ bits), unless that bigger $n$ has known factorization into terms of at most $2112$ bits (in which case the usual CRT technique applies and significantly helps).

That issue is encountered when one wants to compute the RSA public key function for 4096-bit key on top of software (or API to hardware) limited to $2048$-bits-and-then some.

Especially if $e$ is small (like $65537$, $17$, $3$, or $2$), it is sometime possible to do a fast-enough software-only implementation in assembly language (which typically beats C by a decimal order of magnitude, and interpreted bytecode much more so). And for the purpose of signature verification, this is unquestionably safe.

But even if $e$ is small, if the context is a JavaCard Smart Card without any way to evade the JavaCard Virtual Machine, I'm afraid there is no practical solution, unless execution time is not an issue.

how to iteratively calculate a^emod n with modulus n sized 4096 bits

2 Answers2

Linked