The standard method for doing multiplication (and multiplicative inverses) in $\operatorname{GF}(2^8)$ is using a log and antilog table. Each table takes up only 255 bytes; hence it is much smaller than a full $256 \times 256$ multiplication table, and it is much faster than the multiplication procedure you give above.
To create such tables, we need to pick a generator $g$; that is a field element such that $g^i$ is all 255 nonzero elements for $0 \le i < 255$ (where $g^i$ is what you would expect; $g$ multiplied by itself $i$ time). Such an element always exists, and (IIRC) $g=3$ happens to be a generator in the field representation you are using.
Now, the antilog table is defined as:
$${\rm antilog}(i) = g^i$$
This table has 255 elements, and can be easily built using your multiplication procedure. You may want to extend it farther (as discussed below); it just continues in the obvious way (and it just repeats itself).
The log table is defined as its inverse, that is:
$$\log(\operatorname{antilog}(i)) = i$$
This table also has 255 elements (but is indexed from 1 to 255), and can be built using the antilog table we already have, or just initialized at the same time:
uint8_t log_table[256], antilog[255];
const uint8_t g = 3;
void init_log_table(void)
{
log_table[0] = 0; /* dummy value */
for (int i = 0, x = 1; i < 255; x = gmul(x, g), i++) {
log_table[x] = i;
antilog[i] = x;
}
}
Once we have those two, we can do multiplication by:
uint8_t gmul_table(uint8_t a, uint8_t b)
{
if (a == 0 || b == 0) return 0;
uint8_t x = log_table[a];
uint8_t y = log_table[b];
uint8_t log_mult = (x + y) % 255;
return antilog[log_mult];
}
This works because, if $a = g^x$ and $b = g^y$, then $a \times b = g^x \times g^y = g^{x+y} = g^{(x+y) \bmod 255}$
As you can see, that should be considerably more efficient than your original algorithm; you can omit the % 255
operation by extending the antilog table for 255 more elements.
We can also use those tables to compute multiplicative inverses:
uint8_t ginv_table(uint8_t a)
{
if (a == 0) return 0; /* as needed in the context of AES */
uint8_t x = log_table[a]; /* x is in range [0..255] */
uint8_t log_inv = 255 - x; /* log_inv is in range [0..255] */
return antilog[log_inv];
}
You can also compute inverses using the Extended Euclidean Algorithm (which can be adapted to work in $\operatorname{GF}(2^8)$; however it's considerably more work than two table lookups.
Here is how that algorithm would look like:
uint8_t ginv(uint8_t x)
{
uint16_t u1 = 0, u3 = 0x11b, v1 = 1, v3 = x;
while (v3 != 0) {
uint16_t t1 = u1, t3 = u3;
int8_t q = bitlength(u3) - bitlength(v3);
if (q >= 0) {
t1 ^= v1 << q;
t3 ^= v3 << q;
}
u1 = v1; u3 = v3;
v1 = t1; v3 = t3;
}
if (u1 >= 0x100) u1 ^= 0x11b;
return u1;
}
where bitlength(x)
returns the position of the most significant 1 bit of x
(i.e. the smallest number y
such that (1 << y) - 1 >= x
).
q = numbits(z)
, defined as the number of bits set inz
(for example,q=1
forz=8
); can you explain? – fgrieu Sep 22 '16 at 06:44v3<<q
cancels out the msbit ofu3
. x is u3 will all the bits to the right of the msbit set (so if u3 = 0x12, then x = 0x1f); similarly y is v3 with all the bits to the right of the msbit set. Because of this, z will consist of those bits where there's a bit set to the left in u3, but not in v3. Hence, the number of such bits is the number of bits we need to shift v3 left. – poncho Sep 22 '16 at 14:39% 255
ingmul_table()
with the "digital root"log_mult = (log_mult >> 8) + (log_mult & 0xFF)
. It won't give exactly the same results as% 255
, since it won't map 255 to 0 (but it will correctly map 256 to 1, 257 to 2, etc.), but that's easily handled by adding a single extra entry to theantilog[]
table. Basically, compared to just extending the table, it saves 1/4 kB at the cost of 1-3 simple operations (shift, mask & add, or possibly just a single 8-bit add-with-carry). – Ilmari Karonen Sep 23 '16 at 18:27ginv()
code to use abitlength()
function/macro instead, hopefully making the actual algorithm clearer (and probably faster too, at least on platforms that have a built-in bit length / leading zero count instruction). And yes, I checked that it still works. – Ilmari Karonen Sep 24 '16 at 20:05bitlength(x)
would be to use#define bitlength(x) (32 - __builtin_clz(x))
, if $x > 0$ .... – Aravind A Jun 24 '20 at 09:02