The first 10 lines can be replaced with uint64_t blocks = (len+72)/64;
and that will fix the code for the example at hand (as well as extend the message capacity if int
is 32-bit and len
is 64-bit, which can't be told from the code¹). My mental process to construct (len+72)/64
, which works for many things in crypto that divide things into equal size blocks, is as follows:
- When we increase the number of bytes
len
by 64, there must be exactly one more 64-byte (512-bit) block; and the number of blocks increases monotonically with len
. That's all it takes to mathematically imply that the number of blocks is the integer part of $\frac{\mathtt{len}+\mathtt{cst}}{64}$ for some value of the integer constant cst
. By definition of integer division in the C language, (len+cst)/64
will thus compute blocks
as long as nothing overflows. It only remains to find cst
.
- When
len
is 55, the padding byte 0b10000000 and the 8 bytes of length fit a single 64-byte block, since 55+1+8=64; but when len
is 56, we need a second block. That narrows cst
to a precise value. One way to find it is: len
of 56 is the smallest value that makes the desired result 2, hence 56+cst
must be 2*64
, hence cst
is 2*64-56
, that is 72
.
Also, (unsigned char)(((uint64_t)size & (uint64_t)(0xff << i * 8)) >> 8 * i)
does not work as expected. It will fail for len
>536870911 for many platforms, len
>8191 for some. The issue is with (uint64_t)(0xff << i * 8)
where the cast to 64-bit occurs after the shift, which is performed on the width of int
, since that's the type of 0xff
, and yields unspecified result when i*8
reaches or exceeds the bit width of int
. This issue can be fixed with ((uint64_t)0xff << i * 8)
(note the different position of the cast). But it is simpler to use (unsigned char)(size >> i*8)
instead of the whole expression.
The int
type used for the loop index in for (int i = 0; i < len; i++)
often limits how many bytes can be handled, even if the type for len
does not. But that always occurs for larger len
than what hits the above.
The computations related to K
ending in K / 8
, giving the number of zero bytes to append, can be simplified to 63&(unsigned)(55-len)
. Mental process to find this: the desired quantity is an integer that decreases by one when len
increases by one, except when that brings is below zero, in which case the next count is 63. Hence the right expression is mathematically $(\text{cst}-\text{len})\bmod64$ for some $\text{cst}$. When $\text{len}$ is 55, the desired outcome is zero, hence one appropriate $\text{cst}$ is 55 (-7 would also nicely do). There remains to compute $(55-\text{len})\bmod64$ despite the lack of a portable modulo operator². Since the modulus is a power of two, we can³ use 63&(unsigned)(55-len)
.
The result of calloc
is not checked; that will lead to disaster if running out of memory. That's a reason production-quality code should not make a copy of the whole input data; another other is performance⁴.
I may have missed other issues with the code (I did in earlier readings). Crypto code needs to be exact in all cases, and that's harder to ascertain than it seems. Often, when having to chose among alternatives, I minimize constructs which non-portable behavior, and as a secondary criteria use the shorter code, especially if it remove tests or variables, because that tends to lead to code that is easier to prove right.
¹ If len
was 32-bit or narrower, the later uint64_t L = len * 8;
would overflow at a plausible size (at most 256kiB or 512kiB depending on signedness, perhaps as low as 8kiB if len
is 16-bit and signed), and that would occur before len+72
overflows.
² C's %
operator is not the mathematical modulo operator because the result can be negative (at least on some platforms) for negative arguments, and thus has to be brought back to positive with a test, just as the question's source does after K = (448 - L - 1) % 512;
.
³ The simpler 63&(55-len)
will also do. However that manipulates a possibly negative quantity, and some toolchains will bark; also the result has at least the width of len
, which can be unnecessarily large.
⁴ About performance:
- Code that does allocate a block for a copy of all the input has no reason to use
calloc
(which zeroes) when malloc
(which does not) will do, since zeroes are explicitly appended.
- Not all compilers will optimize the copy loop, thus increasing the speed penalty incurred by the copy. That's a good indication for
memcpy
.
- generated code is often better given
for (int i = K/8; --i>=0;)
than it is given for (int i = 0; i < (K / 8); i++)
because:
- a loop comparing to a constant rather than a variable to determine its end can be faster, and the constant 0 is especially nice.
- there's no need to the compiler to determine that
K/8
does not change in the loop, thus can be precomputed, nor need for something to store that intermediary quantity during the loop.
Note: this is borderline on-topic, since we are into programing, and even into the C language and microoptimisations. However these problems are recurrent in implementations of cryptography, speed often matters in crypto, and/thus a lot (perhaps the most used) crypto implementations are in C or similar, or even lower-level languages.
K <= 7
before doingK -= 7
? – Rutrus Sep 29 '21 at 05:40