I recently found the GitHub repository used to make the measurements in the AES-GCM-SIV paper where they implement polynomial hashing using POLYVAL.
This means in this context to compute the usual $\tau=\sum_{i=0}^nm_iH^{n-i}$ for 16-byte message blocks $m_i$ and a 128-bit key $H$ in the usual way as $H_{i+1}=(m_i+H_i)\cdot H$ with messages and $H$ being interpreted as polynomials over $\mathbb F_2[x]/(x^{128}+x^{127}+x^{126}+x^{121}+1)$.
Now the actual computation uses x86 hardware intrinsics (which can be looked up here).
The code in question is Polyval_Horner
(polyval.c
, line 137) in particular the following extract (adopted and commented from the repo):
__m128i TMP0, TMP1, TMP2, TMP3, TMP4, T, POLY, H;
H = _mm_loadu_si128(((__m128i*)pH));
T = _mm_loadu_si128(((__m128i*)TAG));
// ordering of the inputs is reversed, last is most significant
// 0xc2000000 corresponds to the top 3 POLYVAL coefficients
POLY = _mm_setr_epi32(0x1,0,0,0xc2000000);
T = _mm_xor_si128(T, _mm_loadu_si128((__m128i*)inp));
// This instruction takes two 64-bit halves and carrylessly multiplies them
// If the lower nibble is 0, take the lower half of the first input, else the upper half
// likewise with the upper nibble for the second input
TMP1 = _mm_clmulepi64_si128(T, H, 0x00);
TMP4 = _mm_clmulepi64_si128(T, H, 0x11);
TMP2 = _mm_clmulepi64_si128(T, H, 0x10);
TMP3 = _mm_clmulepi64_si128(T, H, 0x01);
// TMP2 and 3 contain the range of coefficients from 64 to 191, add them
TMP2 = _mm_xor_si128(TMP2, TMP3);
// now extract the upper and lower halves of these coefficients and add them
// into either TMP1 or 4 depending on whether they are the lower or the upper coefficients
TMP3 = _mm_slli_si128(TMP2, 8);
TMP2 = _mm_srli_si128(TMP2, 8);
TMP1 = _mm_xor_si128(TMP3, TMP1);
TMP4 = _mm_xor_si128(TMP4, TMP2);
// reduction starts here
// multiply the lower half of TMP1 with the upper half of POLY
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
// This re-orders the 32-bit subwords
// 78 should exactly swap the 64-bit halves
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
T = _mm_xor_si128(TMP4, TMP1);
(I have not converted to pseudo-code as to preserve any specific instruction behavior I may get wrong).
This code loads a key pH
and a previous iteration TAG
adds it with some 16-byte input inp
and multiplies it with pH
in a carryless way and then reduces it modulo the Polyval polynomial into the new value of T
.
My reading of the above code is that
- Before the reduction
TMP4
holds the 128 most-significant polynomial coefficients of the multiplication result - Before the reduction
TMP1
holds the 128 least-significant polynomial coefficient of the multiplication result
Now my question is:
How does this reduction algorithm work?
Because for me, if I try it on paper with $x^{127}$ and $x$ I should get back $x^{127}+x^{126}+x^{121}+1$ but instead I think the algorithm returns me $1$.
Here's my interpretation of how to read the reduction intrinsics:
- Take the lower 64-bit of the lower 128-bit of the multiplication result, multiply them with the upper 64-bit of the polynomial (in my example that's $0$ times the upper bits), call it
TMP2
- Swap the 64-bit halves of the original 128-bit result, call it
TMP3
($0$ in my example becauseTMP1
would be 0) - Add
TMP2
andTMP3
, call the resultTMP1
($0+0$) - Repeat the previous three steps once
- Return the addition of the current
TMP1
andTMP4
(the upper 128-bit) which in my case would be $0+1=1$