How to pad a 448 bit message for SHA256?

Question

I recently implemented a very simple SHA256 hashing program in C++, since the SHA256 hash function's block size is 512 bits or 64 bytes, and messages less than that size have to be padded as per the FIPS standard, I implemented a padding function following their exact rules.

$(K+1+L) = 448 \mod 512 $ , where $K$ is the number of $0$ bits that should be appended after the $1$ bit, previously appended to the message.

I was testing my code with test vectors that I received from this website : https://www.di-mgt.com.au/sha_testvectors.html

and unfortunately it failed for the test vector of 448 bits: abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq but worked well for the previous test vectors. Here is a simple snippet of C++ code that I made which is responsible for padding, obviously, there is an algorithmic flaw in it out of my misunderstanding!

int blocks = 0;
if (len % 64) // len is the message length
{
    blocks = (len + (64 - (len % 64))) / 64;
}
else
{
    // I realized (the hard way) that messages that are of size 512 bits or its multiples have an additional block that contains padding info and they run through the compression function.
    blocks = (len / 64) + 1;
}
unsigned char *s = (unsigned char *)calloc(blocks * 64, sizeof(unsigned char)); // Internal storage for the message with padding.
for (int i = 0; i < len; i++)
{
    s[i] = m[i]; // m is the message passed to the function, that has to be hashed.
}
int K = 0;
uint64_t L = len * 8;
uint64_t size = L;

K = (448 - L - 1) % 512;
if (K < 0)
{
    K += 512;
}

s[len] = 0b10000000; // or 0x80
K -= 7;
for (int i = 0; i < (K / 8); i++)
{
    s[i + len + 1] = 0; // The zero bytes...
}
    // Append the length of the message as a 64 bit big endian integer
for (int i = 0; i < 8; i++)
{
    s[blocks * 64 - 1 - i] = (unsigned char)(((uint64_t)size & (uint64_t)(0xff << i * 8)) >> 8 * i);
}

I would like to create a padding method that conforms to the FIPS standard for any bit length whatsoever... I realize that with a message of length 448 bits, my padding will overflow to the other block. Should the other block be padded in the same way a message would be padded? What changes should I make in the algorithm to fit all message lengths?

Could be K <= 7 before doing K -= 7 ? – Rutrus Sep 29 '21 at 05:40 — Rutrus, Sep 29 '21 at 05:40

score 2 · Answer 1 · edited Apr 06 '20 at 19:45

2

The padding method for SHA-256 is (assuming you're byte oriented - it appears you are) is 'append an 0x80 byte, and then add 0x00 byte's until the length modulo 64 bytes is 56. This implies that if the original message length modulo 64 is 56 or larger, you'll need to do one more hash compression operation.

That means that your computation of the number of hash compression operations you'll need:

if (len % 64) {
    blocks = (len + (64 - (len % 64))) / 64;
} else {
    blocks = (len / 64) + 1;
}

is wrong; if len=56 (in this case), this sets blocks=1 (and thus you don't allocate any space for the padding). More correct would be:

/* Compute the number of bytes of the message plus mandatory padding */
plen = len + 9;   /* 1 for the 0x80 byte, 8 for the length */

/* Number of blocks is the number of message plus padding bytes */
/* div 64, rounded up */
blocks = (plen + 63) / 64;

Also, the number of 0x00 bytes you'll need to add between the 0x80 byte and the length field is precisely 64 * blocks - plen

edited Apr 06 '20 at 19:45

Aravind A

1,020
10
21

answered Apr 06 '20 at 18:56

poncho

147,019
11
229
360

Thankyou very much for your answer ! – Aravind A Apr 06 '20 at 19:46
Can you please explain the reason for adding 63 to plen – Aravind A Apr 06 '20 at 21:02
1

@VivekanandV: plen div 64 (rounded up) can be implemented by adding 64-1, and then doing div 64 (rounding down) – poncho Apr 06 '20 at 21:33
are we doing this because if len becomes zero then we still want blocks to be 1 ? – Aravind A Apr 06 '20 at 22:45
1

@VivekanandV: plen is the number of bytes we know we have to include in the hash (the message itself, the bytes of padding we always add); we add enough 0x00 bytes to make an even number of blocks. That is, the number of blocks is plen divided by 64, but rounded up (because we add 0x00 bytes until we hit the next multiple). (plen+63)/64 is just a way to implement this 'div rounded up' operation – poncho Apr 07 '20 at 12:13

fgrieu · Accepted Answer · 2020-04-07T16:20:04.260

The first 10 lines can be replaced with uint64_t blocks = (len+72)/64; and that will fix the code for the example at hand (as well as extend the message capacity if int is 32-bit and len is 64-bit, which can't be told from the code¹). My mental process to construct (len+72)/64, which works for many things in crypto that divide things into equal size blocks, is as follows:

When we increase the number of bytes len by 64, there must be exactly one more 64-byte (512-bit) block; and the number of blocks increases monotonically with len. That's all it takes to mathematically imply that the number of blocks is the integer part of $\frac{\mathtt{len}+\mathtt{cst}}{64}$ for some value of the integer constant cst. By definition of integer division in the C language, (len+cst)/64 will thus compute blocks as long as nothing overflows. It only remains to find cst.
When len is 55, the padding byte 0b10000000 and the 8 bytes of length fit a single 64-byte block, since 55+1+8=64; but when len is 56, we need a second block. That narrows cst to a precise value. One way to find it is: len of 56 is the smallest value that makes the desired result 2, hence 56+cst must be 2*64, hence cst is 2*64-56, that is 72.

Also, (unsigned char)(((uint64_t)size & (uint64_t)(0xff << i * 8)) >> 8 * i) does not work as expected. It will fail for len>536870911 for many platforms, len>8191 for some. The issue is with (uint64_t)(0xff << i * 8) where the cast to 64-bit occurs after the shift, which is performed on the width of int, since that's the type of 0xff, and yields unspecified result when i*8 reaches or exceeds the bit width of int. This issue can be fixed with ((uint64_t)0xff << i * 8) (note the different position of the cast). But it is simpler to use (unsigned char)(size >> i*8) instead of the whole expression.

The int type used for the loop index in for (int i = 0; i < len; i++) often limits how many bytes can be handled, even if the type for len does not. But that always occurs for larger len than what hits the above.

The computations related to K ending in K / 8, giving the number of zero bytes to append, can be simplified to 63&(unsigned)(55-len). Mental process to find this: the desired quantity is an integer that decreases by one when len increases by one, except when that brings is below zero, in which case the next count is 63. Hence the right expression is mathematically $(\text{cst}-\text{len})\bmod64$ for some $\text{cst}$. When $\text{len}$ is 55, the desired outcome is zero, hence one appropriate $\text{cst}$ is 55 (-7 would also nicely do). There remains to compute $(55-\text{len})\bmod64$ despite the lack of a portable modulo operator². Since the modulus is a power of two, we can³ use 63&(unsigned)(55-len).

The result of calloc is not checked; that will lead to disaster if running out of memory. That's a reason production-quality code should not make a copy of the whole input data; another other is performance⁴.

I may have missed other issues with the code (I did in earlier readings). Crypto code needs to be exact in all cases, and that's harder to ascertain than it seems. Often, when having to chose among alternatives, I minimize constructs which non-portable behavior, and as a secondary criteria use the shorter code, especially if it remove tests or variables, because that tends to lead to code that is easier to prove right.

¹ If len was 32-bit or narrower, the later uint64_t L = len * 8; would overflow at a plausible size (at most 256kiB or 512kiB depending on signedness, perhaps as low as 8kiB if len is 16-bit and signed), and that would occur before len+72 overflows.

² C's % operator is not the mathematical modulo operator because the result can be negative (at least on some platforms) for negative arguments, and thus has to be brought back to positive with a test, just as the question's source does after K = (448 - L - 1) % 512;.

³ The simpler 63&(55-len) will also do. However that manipulates a possibly negative quantity, and some toolchains will bark; also the result has at least the width of len, which can be unnecessarily large.

⁴ About performance:

Code that does allocate a block for a copy of all the input has no reason to use calloc (which zeroes) when malloc (which does not) will do, since zeroes are explicitly appended.
Not all compilers will optimize the copy loop, thus increasing the speed penalty incurred by the copy. That's a good indication for memcpy.
generated code is often better given for (int i = K/8; --i>=0;) than it is given for (int i = 0; i < (K / 8); i++) because:
- a loop comparing to a constant rather than a variable to determine its end can be faster, and the constant 0 is especially nice.
- there's no need to the compiler to determine that K/8 does not change in the loop, thus can be precomputed, nor need for something to store that intermediary quantity during the loop.

Note: this is borderline on-topic, since we are into programing, and even into the C language and microoptimisations. However these problems are recurrent in implementations of cryptography, speed often matters in crypto, and/thus a lot (perhaps the most used) crypto implementations are in C or similar, or even lower-level languages.

Is there any cryptographic reason why padding is always mandatory as per the FIPS standard? Also is that uint64_t type cast the compatibility problem in my expression for extracting the bytes out of size ? — Aravind A, Apr 06 '20 at 20:55
@Vivekanand V: The reason for padding to multiple of 64 bytes is that the overall construction of the hash is an iterated function processing 64 bytes of message at a time. Including the length in the last block has good cryptographic rationale, see this. The padding with 0b10000000 is believed redundant. I now further explain the issue for the expression extracting the bytes out of size, it is strictly limited to the 0xff << fragment, the rest works OK. — fgrieu, Apr 06 '20 at 22:01
Thankyou! But now in your opinion which among these will be better and optimising: (unsigned char) (size >> 8*i) or (size >> 8*i) % 256 ? — Aravind A, Apr 06 '20 at 22:43
@VivekanandV: I prefer the former, because 1) in the later the type remains 64-bit and some compilers under some settings will rightly emit a warning. If it was not for that, we could write s[blocks * 64 - 1 - i] = size >> 8*i; 2) in the later, some old compilers will actually emit code for a division. — fgrieu, Apr 07 '20 at 06:35
Thankyou! Can you please tell me the intuition behind blocks = (len+72)/64 , why 72 ? — Aravind A, Apr 07 '20 at 07:50
@VivekanandV: I explained how to compute the number of zero bytes to append more simply, documented two other potential issues with the code, moved all strictly performance-related points to note⁴, and added a few tips there. — fgrieu, Apr 07 '20 at 16:04

How to pad a 448 bit message for SHA256?

2 Answers2

Linked