How does this POLYVAL modular reduction algorithm work?

Question

I recently found the GitHub repository used to make the measurements in the AES-GCM-SIV paper where they implement polynomial hashing using POLYVAL.

This means in this context to compute the usual $\tau=\sum_{i=0}^nm_iH^{n-i}$ for 16-byte message blocks $m_i$ and a 128-bit key $H$ in the usual way as $H_{i+1}=(m_i+H_i)\cdot H$ with messages and $H$ being interpreted as polynomials over $\mathbb F_2[x]/(x^{128}+x^{127}+x^{126}+x^{121}+1)$.

Now the actual computation uses x86 hardware intrinsics (which can be looked up here).
The code in question is Polyval_Horner (polyval.c, line 137) in particular the following extract (adopted and commented from the repo):

__m128i TMP0, TMP1, TMP2, TMP3, TMP4, T, POLY, H;
H = _mm_loadu_si128(((__m128i*)pH));
T = _mm_loadu_si128(((__m128i*)TAG));
// ordering of the inputs is reversed, last is most significant
// 0xc2000000 corresponds to the top 3 POLYVAL coefficients
POLY = _mm_setr_epi32(0x1,0,0,0xc2000000);

T = _mm_xor_si128(T, _mm_loadu_si128((__m128i*)inp));
// This instruction takes two 64-bit halves and carrylessly multiplies them
// If the lower nibble is 0, take the lower half of the first input, else the upper half
// likewise with the upper nibble for the second input
TMP1 = _mm_clmulepi64_si128(T, H, 0x00);
TMP4 = _mm_clmulepi64_si128(T, H, 0x11);
TMP2 = _mm_clmulepi64_si128(T, H, 0x10);
TMP3 = _mm_clmulepi64_si128(T, H, 0x01);
// TMP2 and 3 contain the range of coefficients from 64 to 191, add them
TMP2 = _mm_xor_si128(TMP2, TMP3);
// now extract the upper and lower halves of these coefficients and add them
// into either TMP1 or 4 depending on whether they are the lower or the upper coefficients
TMP3 = _mm_slli_si128(TMP2, 8);
TMP2 = _mm_srli_si128(TMP2, 8);
TMP1 = _mm_xor_si128(TMP3, TMP1);
TMP4 = _mm_xor_si128(TMP4, TMP2);
// reduction starts here
// multiply the lower half of TMP1 with the upper half of POLY
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
// This re-orders the 32-bit subwords
// 78 should exactly swap the 64-bit halves
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10);
TMP3 = _mm_shuffle_epi32(TMP1, 78);
TMP1 = _mm_xor_si128(TMP3, TMP2);
T = _mm_xor_si128(TMP4, TMP1);

(I have not converted to pseudo-code as to preserve any specific instruction behavior I may get wrong). This code loads a key pH and a previous iteration TAG adds it with some 16-byte input inpand multiplies it with pH in a carryless way and then reduces it modulo the Polyval polynomial into the new value of T.

My reading of the above code is that

Before the reduction TMP4 holds the 128 most-significant polynomial coefficients of the multiplication result
Before the reduction TMP1 holds the 128 least-significant polynomial coefficient of the multiplication result

Now my question is:

How does this reduction algorithm work?

Because for me, if I try it on paper with $x^{127}$ and $x$ I should get back $x^{127}+x^{126}+x^{121}+1$ but instead I think the algorithm returns me $1$.

Here's my interpretation of how to read the reduction intrinsics:

Take the lower 64-bit of the lower 128-bit of the multiplication result, multiply them with the upper 64-bit of the polynomial (in my example that's $0$ times the upper bits), call it TMP2
Swap the 64-bit halves of the original 128-bit result, call it TMP3 ($0$ in my example because TMP1 would be 0)
Add TMP2 and TMP3, call the result TMP1 ($0+0$)
Repeat the previous three steps once
Return the addition of the current TMP1 and TMP4 (the upper 128-bit) which in my case would be $0+1=1$

Note: There's precedent for a question asking how a given algorithm works: For BearSSL. And I could also write this algorithm down (only) in pseudo-code but I might make an error in translation which would break the question. — SEJPM, Jun 11 '20 at 15:25

score 1 · Accepted Answer · answered Jul 07 '21 at 19:09

So, I finally figured this out. The key references to this are the slide-deck by Shay Gueron from RWC'13, the Handbook of Applied Cryptography (Chapter 14 PDF) and RFC 8452.

As it turns out, the core operation of POLYVAL is in fact not $a\cdot b\bmod p$ but rather instead $a\cdot b\cdot x^{-128}\bmod p$ (as specified in RFC 8452). To do this, the algorithm from the question computes a montgomery reduction of a schoolbook carryless multiplication which stores the high 128 bit of the result in TMP4 and the low 128 bit in TMP1. What follows is then a Montgomery reduction (algorithm 14.23), the parameters for the reduction as described in the Handbook are as follows:

The modulus $m$ is $m(x)=x^{128}+x^{64}p(x)+1$ with $p(x)=x^{63}+x^{62}+x^{57}$ which corresponds to the constant 0xc2... from the code.
The input $X$ is $X_3,X_2$ from TMP4 and $X_1,X_0$ from TMP0.
The base being used is $b=x^{64}$ as PCLMULQDQ offers 64x64 bit carryless multiplications.
The inverse modulus $m'=-m^{-1}\bmod b= (x^{64}+p(x))x^{64}+1 \bmod x^{64}=1$
From the above the number of words being used in the base for the result is $n=2$.
The Montgomery value $R$ is $R=x^{128}$.

The algorithm in the handbook then asks to compute the result $A=x^{-128}(X_3x^{192}+X_2x^{128}+X_1x^{64}+X_0+X_0m+X_1mx^{64})$ once one has expanded all steps. Substituting $m$ and simplifying yields the expression $A=x^{-128}(X_3x^{192}+X_2x^{128}+X_0x^{128}+X_0x^{64}p(x)+X_1x^{128}p(x)+X_1x^{192})$ which yields $A=X_3x^{64}+X_2+X_1x^{64}+X_0+X_0x^{-64}p(x)+X_1p(x)$ where the only tricky sub-expression is $X_0x^{-64}p(x)$ which can be resolved first with splitting this 128-bit value in half as $\operatorname{high64}(X_0p(x))+\operatorname{low64}(X_0p(x))x^{-64}$ as the high 64 coefficients can be immediately multiplied. We then observe that $x^{-64}\equiv p(x)+x^{64}\pmod{x^{128}+x^{64}p(x)+1}$ yielding $\operatorname{low64}(X_0p(x))p(x)+\operatorname{low64}(X_0p(x))x^{64}$.

Following the description from the slides we get:

$A=X_0p(x)$ (matches the TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10); instruction)
$B=X_0x^{64}+X_1+X_0p(x)$ (matches the TMP3 = _mm_shuffle_epi32(TMP1, 78); TMP1 = _mm_xor_si128(TMP3, TMP2); instructions)
$C=X_1p(x)+\operatorname{low64}(X_0p(x))p(x)$ (matches the second TMP2 = _mm_clmulepi64_si128(TMP1, POLY, 0x10); instruction)
$D=X_1p(x)+\operatorname{low64}(X_0p(x))p(x)+X_0+X_1x^{64}+\operatorname{low64}(X_0p(x))x^{64}+\operatorname{high64}(X_0p(x))$ (matches the second TMP3 = _mm_shuffle_epi32(TMP1, 78); TMP1 = _mm_xor_si128(TMP3, TMP2); instructions)
$\text{Out}=X_3x^{192}+X_2x^{128}+X_1x^{64}+X_0+X_1p(x)+\operatorname{low64}(X_0p(x))p(x)+\operatorname{low64}(X_0p(x))x^{64}+\operatorname{high64}(X_0p(x))$ (matches the T = _mm_xor_si128(TMP4, TMP1); instruction)

The last value matches the fully expanded $A$ expression that we derived from the Montgomery reduction algorithm above.

score 0 · Answer 2 · answered Jul 06 '20 at 17:31

From the paper if you look at section 2.3 you will find an explanation of the reduction. The reduction from a code execution prospective works by reducing the number of operations of GHASH.

From a theoretical stand point the paper authors took an incorrect derivation direction of the first induction proof. However their mistake resulted in discovering a differential set linkage between POLYVAL and GHASH.

Where A ^ B ---> D0 and Z* is the number of checks to determine the value of D0. After completing running GHASH the function returns D*. From this the operator can say B ---> D0 and Z < Z*.

I suppose you are talking about the original GCM-SIV paper? So you're essentially saying that the reduction essentially just does GHASH reduction (it seems Gueron explored this in the past PDF?)? Please also note that you can use normal latex-mathmode on this side to help your points. — SEJPM, Jul 06 '20 at 20:26

How does this POLYVAL modular reduction algorithm work?

2 Answers2