6

I recently found the GitHub repository used to make the measurements in the AES-GCM-SIV paper where they implement polynomial hashing using POLYVAL.

This means in this context to compute the usual $\tau=\sum_{i=0}^nm_iH^{n-i}$ for 16-byte message blocks $m_i$ and a 128-bit key $H$ in the usual way as $H_{i+1}=(m_i+H_i)\cdot H$ with messages and $H$ being interpreted as polynomials over $\mathbb F_2[x]/(x^{128}+x^{127}+x^{126}+x^{121}+1)$.

Now the actual computation uses x86 hardware intrinsics (which can be looked up here).
The code in question is Polyval_Horner (polyval.c, line 137) in particular the following extract (adopted and commented from the repo):

__m128i TMP0, TMP1, TMP2, TMP3, TMP4, T, POLY, H;
H = _mm_loadu_si128(((__m128i*)pH));
T = _mm_loadu_si128(((__m128i*)TAG));
// ordering of the inputs is reversed, last is most significant
// 0xc2000000 corresponds to the top 3 POLYVAL coefficients
POLY = _mm_setr_epi32(0x1,0,0,0xc2000000);

T = _mm_xor_si128(T, _mm_loadu_si128((__m128i*)inp));
// This instruction takes two 64-bit halves and carrylessly multiplies them
// If the lower nibble is 0, take the lower half of the first input, else the upper half
// likewise with the upper nibble for the second input
TMP1 = _mm_clmulepi64_si128(T, H, 0x00);
TMP4 = _mm_clmulepi64_si128(T, H, 0x11);
TMP2 = _mm_clmulepi64_si128(T, H, 0x10);
TMP3 = _mm_clmulepi64_si128(T, H, 0x01);
// TMP2 and 3 contain the range of coefficients from 64 to 191, add them
TMP2 = _mm_xor_si128(TMP2, TMP3);
// now extract the upper and lower halves of these coefficients and add them
// into either TMP1 or 4 depending on whether they are the lower or the upper coefficients
TMP3 = _mm_slli_si128(TMP2, 8);
TMP2 = _mm_srli_si128(TMP2, 8);
TMP1 = _mm_xor_si128(TMP3, TMP1);
TMP4 = _mm_xor_si128(TMP4, TMP2);
// reduction starts here
// multiply the lower half of TMP1 with the upper half of POLY
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
// This re-orders the 32-bit subwords
// 78 should exactly swap the 64-bit halves
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
T = _mm_xor_si128(TMP4, TMP1);

(I have not converted to pseudo-code as to preserve any specific instruction behavior I may get wrong). This code loads a key pH and a previous iteration TAG adds it with some 16-byte input inpand multiplies it with pH in a carryless way and then reduces it modulo the Polyval polynomial into the new value of T.

My reading of the above code is that

  • Before the reduction TMP4 holds the 128 most-significant polynomial coefficients of the multiplication result
  • Before the reduction TMP1 holds the 128 least-significant polynomial coefficient of the multiplication result

Now my question is:

How does this reduction algorithm work?

Because for me, if I try it on paper with $x^{127}$ and $x$ I should get back $x^{127}+x^{126}+x^{121}+1$ but instead I think the algorithm returns me $1$.


Here's my interpretation of how to read the reduction intrinsics:

  1. Take the lower 64-bit of the lower 128-bit of the multiplication result, multiply them with the upper 64-bit of the polynomial (in my example that's $0$ times the upper bits), call it TMP2
  2. Swap the 64-bit halves of the original 128-bit result, call it TMP3 ($0$ in my example because TMP1 would be 0)
  3. Add TMP2 and TMP3, call the result TMP1 ($0+0$)
  4. Repeat the previous three steps once
  5. Return the addition of the current TMP1 and TMP4 (the upper 128-bit) which in my case would be $0+1=1$
SEJPM
  • 45,967
  • 7
  • 99
  • 205
  • Note: There's precedent for a question asking how a given algorithm works: For BearSSL. And I could also write this algorithm down (only) in pseudo-code but I might make an error in translation which would break the question. – SEJPM Jun 11 '20 at 15:25

2 Answers2

1

So, I finally figured this out. The key references to this are the slide-deck by Shay Gueron from RWC'13, the Handbook of Applied Cryptography (Chapter 14 PDF) and RFC 8452.

As it turns out, the core operation of POLYVAL is in fact not $a\cdot b\bmod p$ but rather instead $a\cdot b\cdot x^{-128}\bmod p$ (as specified in RFC 8452). To do this, the algorithm from the question computes a montgomery reduction of a schoolbook carryless multiplication which stores the high 128 bit of the result in TMP4 and the low 128 bit in TMP1. What follows is then a Montgomery reduction (algorithm 14.23), the parameters for the reduction as described in the Handbook are as follows:

  • The modulus $m$ is $m(x)=x^{128}+x^{64}p(x)+1$ with $p(x)=x^{63}+x^{62}+x^{57}$ which corresponds to the constant 0xc2... from the code.
  • The input $X$ is $X_3,X_2$ from TMP4 and $X_1,X_0$ from TMP0.
  • The base being used is $b=x^{64}$ as PCLMULQDQ offers 64x64 bit carryless multiplications.
  • The inverse modulus $m'=-m^{-1}\bmod b= (x^{64}+p(x))x^{64}+1 \bmod x^{64}=1$
  • From the above the number of words being used in the base for the result is $n=2$.
  • The Montgomery value $R$ is $R=x^{128}$.

The algorithm in the handbook then asks to compute the result $A=x^{-128}(X_3x^{192}+X_2x^{128}+X_1x^{64}+X_0+X_0m+X_1mx^{64})$ once one has expanded all steps. Substituting $m$ and simplifying yields the expression $A=x^{-128}(X_3x^{192}+X_2x^{128}+X_0x^{128}+X_0x^{64}p(x)+X_1x^{128}p(x)+X_1x^{192})$ which yields $A=X_3x^{64}+X_2+X_1x^{64}+X_0+X_0x^{-64}p(x)+X_1p(x)$ where the only tricky sub-expression is $X_0x^{-64}p(x)$ which can be resolved first with splitting this 128-bit value in half as $\operatorname{high64}(X_0p(x))+\operatorname{low64}(X_0p(x))x^{-64}$ as the high 64 coefficients can be immediately multiplied. We then observe that $x^{-64}\equiv p(x)+x^{64}\pmod{x^{128}+x^{64}p(x)+1}$ yielding $\operatorname{low64}(X_0p(x))p(x)+\operatorname{low64}(X_0p(x))x^{64}$.


Following the description from the slides we get:

  • $A=X_0p(x)$ (matches the TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10); instruction)
  • $B=X_0x^{64}+X_1+X_0p(x)$ (matches the TMP3 = _mm_shuffle_epi32(TMP1, 78); TMP1 = _mm_xor_si128(TMP3, TMP2); instructions)
  • $C=X_1p(x)+\operatorname{low64}(X_0p(x))p(x)$ (matches the second TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10); instruction)
  • $D=X_1p(x)+\operatorname{low64}(X_0p(x))p(x)+X_0+X_1x^{64}+\operatorname{low64}(X_0p(x))x^{64}+\operatorname{high64}(X_0p(x))$ (matches the second TMP3 = _mm_shuffle_epi32(TMP1, 78); TMP1 = _mm_xor_si128(TMP3, TMP2); instructions)
  • $\text{Out}=X_3x^{192}+X_2x^{128}+X_1x^{64}+X_0+X_1p(x)+\operatorname{low64}(X_0p(x))p(x)+\operatorname{low64}(X_0p(x))x^{64}+\operatorname{high64}(X_0p(x))$ (matches the T = _mm_xor_si128(TMP4, TMP1); instruction)

The last value matches the fully expanded $A$ expression that we derived from the Montgomery reduction algorithm above.

SEJPM
  • 45,967
  • 7
  • 99
  • 205
0

From the paper if you look at section 2.3 you will find an explanation of the reduction. The reduction from a code execution prospective works by reducing the number of operations of GHASH.

From a theoretical stand point the paper authors took an incorrect derivation direction of the first induction proof. However their mistake resulted in discovering a differential set linkage between POLYVAL and GHASH.

Where A ^ B ---> D0 and Z* is the number of checks to determine the value of D0. After completing running GHASH the function returns D*. From this the operator can say B ---> D0 and Z < Z*.