A classical table-based AES implementation would achieve about 160 MB/s on my current computer (a fairly recent MacBook Pro). However, one can do better; of course there are the AES-NI instructions, that easily bump up speed on that machine to the 5 GB/s mark (with a parallel mode such as AES-CTR; AES-CBC encryption is much slower). But even without these instructions, the Käsper-Schwabe implementation of AES-CTR would offer more than 400 MB/s, a substantial improvement.
Looking outside of AES, there is ChaCha20, as specified in RFC 7539. Using my own implementations, the purely generic, 32-bit plain C code (chacha20_ct
) encrypts or decrypts data at 385 MB/s on my laptop; the SSE2-enhanced implementation (chacha20_sse2
) offers a 584 MB/s.
Generally speaking, block ciphers like AES are versatile primitives, and it can be argued that, by forfeiting versatility and concentrating on the encryption/decryption role, better performance may be achieved. This is what stream ciphers like ChaCha20 are about.
About ten years ago, there was the eSTREAM project which resulted in a portfolio of stream ciphers. On my laptop, SOSEMANUK achieves about 1.64 GB/s, which is not bad for a design from ten years ago. Notably, it is 10 times faster than the table-based AES. (I wrote part of the code; I don't know who packaged it as a Zip archive with modified file names that break compilation.)
Among more modern designs, one may cite NORX. I encountered an implementation on small ARM systems that was consistently trouncing ChaCha20. I suppose it would also clear the 1 GB/s mark on a modern PC.
Summary: 1 GB/s is actually highly feasible with existing algorithms, on standard hardware, without using the AES instructions, and without sacrificing security: all of the above are currently unbroken, despite extensive exposure to vindictive cryptographers.
Of course, excluding the AES-NI instructions is rather artificial: it makes relatively little sense to make benchmarks on a modern CPU without using the features of that CPU. Performance on smaller, embedded systems without an hardware AES implementation may be more relevant.
openssl speed -elapsed -evp aes-128-ccm
is giving me values in the range of 300+GB/s (and it seems to be using just one core). – Lery Nov 08 '17 at 12:19