Interesting implementation practices in popular crypto library

Question

So reading through the salsa20 implementation for the crypto library for go-lang, I noticed something interesting. Specifically this portion. It looks like something I would expect from someone learning programming. They took 80 lines of code of what could have been easily done using an array and iteration. Looking up other implementations I found this unofficial one, which does use iteration. Looking at other they all choose to do this portion using a loop. But these are not the official ones. So why doesn't the official one use the more terse, but equivalent way?

My question really is, is there some security benefit to doing things without an extra array and loop? Is it speed? Is there some attack when you have an extra array and loop?

Part of the reason for that structure is to make it independent of byte ordering. — kasperd, Mar 11 '17 at 00:59

score 29 · Accepted Answer · edited Apr 13 '17 at 12:48

29

This appears to be a simple case of Loop Unrolling, which is a generic software optimization. Implementations of crypto algorithms are often times "unrolled", which means taking the statements that normally go in a loop and hardcoding them in the order they would have happened. This is a performance optimization: The loop counter requires cycles to increment and compare, as well as register space to do either (which may imply additional MOV instructions as well).

So the answer is speed, not security — optimizations such as loop unrolling do not change the output of the algorithm, they are an implementation choice.

Chances are, if you describe an algorithm to 10 different people and ask them to implement it, you will receive 10 different variations of the same program - that all still produce the same output, albeit at varying speeds and implementation complexities. Especially at the C level, performance often times comes at the cost of code clarity.

Theoretically speaking, which means disregarding side channel attacks, the implementation should influence only the performance of the algorithm, not the security. In practice however, due to the potential for side channels, the implementation may influence security, though loop unrolling will most likely not be the culprit. If this was the case, the lack of security would not be visible in the output of the algorithm - side channels exploit information that is only present when the actual algorithm is physically implemented in practice, and the information is extracted via other means.

Why is this not left to compiler optimizations

We do not necessarily like to let the compiler aggressively optimize our crypto code. - it can do things that are less than desirable. Cipher algorithms are usually relatively small and very simple in their implementation (if not conceptually so), and sometimes are implemented directly in Assembly to maximize performance or minimize code size.

Loop unrolling is used in hardware implementations as well.

Why is this done in the official implementations

The official implementations are the ones actually being used by the largest numbers of people, and so obviously should be the most efficient ones available. This is just a common courtesy really, as CPU cycles cost dollars as well as time. This is not particular to cryptography, but is most likely true of software in general.

The "unofficial" implementations you have seen were probably written by somewhere learning the algorithm, for which clarity is a priority, rather then efficiency.

edited Apr 13 '17 at 12:48

Community

1

answered Mar 10 '17 at 17:18

Ella Rose

19,603
6
53
101

Are you sure? I'm not familiar enough with crypto libraries specifically but I can say that this would *never* be tolerated anywhere else at all. For this specifically I would be *extremely* surprised if the compiler wouldn't unroll that loop for you, optimizing away the loop counter (it iterates a constant number of times!) – Prime Mar 10 '17 at 23:27
3

@Prime compiler doesn't always unroll loop that has constant iteration number. Compiler tries to estimate how often specific thing will be called and then applies optimizations if it determines it is "too often", but chances are prediction will failsafe to "dont unroll". I'd say this is correct explanation, and maybe you have too good expectations of optimizers in modern compilers, if they were so great we wouldn't touch assembler (look at OpenSSL). – axapaxa Mar 10 '17 at 23:38
26

As the author of the mentioned code: this answer is correct. Latest Go compilers will probably unroll this loop, but early compilers (<1.5) didn't, so I had to unroll loops manually for performance reasons. – dchest Mar 11 '17 at 00:10
3

By the way, crypto/md5 package has a whole template-based code generator to unroll and inline code: https://golang.org/src/crypto/md5/gen.go – dchest Mar 11 '17 at 00:14
4

@dchest Out of curiosity, did you performance profile the loop unrolling, or was this more a matter of standard practice? I ask just because loop unrolling's a situational optimization; it can increase code size, causing more cache misses, which can reduce performance by an order of magnitude in some cases. I ask not to provide criticism, but out of interest in real-world professional programming practices. – Nat Mar 11 '17 at 00:16
3

@Nat yes, I did. – dchest Mar 11 '17 at 00:17
1

The code would be much improved if it included comments containing brief versions of the explanations above, maybe along with the code for the equivalent loop being unrolled, so the OP wouldn't have had to wonder why the code was written that way. – Witness Protection ID 44583292 Mar 11 '17 at 05:04
@dchest, I'm not familiar with language, but doesn't it have any kind of macro/templating functionality so you could make manual unroll, but still make it look compact in sources? – Oleg V. Volkov Mar 11 '17 at 18:25
@OlegV.Volkov there are no macros in Go, so no, it's not possible to do compile-time generation of code. However, there's a go generate command which can be used to manually do this as mentioned in MD5 example above. – dchest Mar 11 '17 at 19:18
@dchest I stand corrected, thanks for clearing that up. – Prime Mar 12 '17 at 00:39
@axapaxa Very good point, though I've always thought crypto libraries need assembler for reasons other than performance (e.g. using AES instructions or forcing things to stay in registers)? – Prime Mar 12 '17 at 00:42
@dchest I wrote an implementation of "Skein from spec" which actually didn't do any optimizations. It can be helpful as well to have an implementation without all the bells and whistles. It's pretty useful to see a specification in action. It allowed me to send corrected test vectors out to the Skein team before they corrected them theirselves :) Just something to consider in case you have to do it again (sorry, "Keccak from spec" and the reworked "Skein from spec" with the new constants is still under construction). – Maarten Bodewes Mar 13 '17 at 23:30

score 2 · Answer 2 · edited Oct 07 '21 at 07:34

FWIW, That's not even exceptional. Consider the md5c.c source code file in RFC 1321 from 1992 (Appendix A).

It defines a few preprocessor macros (not even proper functions in the C sense), and then uses them in a very repeat-y manner, see below.

Though one wonders if this is a publication issue, not letting software engineering issues get in the way of writing an unambiguous and obviously correct reference implementation. It also might be an issue of manually doing the trivial optimisation to get a speed-up (which looks good in benchmarks), and leaving the time-space tradeoff for others to consider.

/* F, G, H and I are basic MD5 functions. */
#define F(x, y, z) (((x) & (y)) | ((~x) & (z)))
....
/* FF, GG, HH, and II transformations for rounds 1, 2, 3, and 4. ...
#define FF(a, b, c, d, x, s, ac) { \
 (a) += F ((b), (c), (d)) + (x) + (UINT4)(ac); \
 (a) = ROTATE_LEFT ((a), (s)); \
 (a) += (b); \
  }
#define GG...
/* MD5 basic transformation. Transforms state based on block. /
static void MD5Transform (state, block)
{ 
  ...
/ Round 1 /
  FF (a, b, c, d, x[ 0], S11, 0xd76aa478); / 1 /
  FF (d, a, b, c, x[ 1], S12, 0xe8c7b756); / 2 /
  ...
  FF (b, c, d, a, x[15], S14, 0x49b40821); / 16 /

 / Round 2 /
  GG (a, b, c, d, x[ 1], S21, 0xf61e2562); / 17 /
  GG (d, a, b, c, x[ 6], S22, 0xc040b340); / 18 /
  ...
 / and similarly up to 64 */

Interesting implementation practices in popular crypto library

2 Answers2

Linked