The bit interleaving code looks like this:
{
const UINT32 * pI = (const UINT32 *)in;
UINT32 * pS = state;
UINT32 t, x0, x1;
int i;
for (i = laneCount-1; i >= 0; --i)
{
x0 = *(pI++);
t = (x0 ^ (x0 >> 1)) & 0x22222222UL; x0 = x0 ^ t ^ (t << 1);
t = (x0 ^ (x0 >> 2)) & 0x0C0C0C0CUL; x0 = x0 ^ t ^ (t << 2);
t = (x0 ^ (x0 >> 4)) & 0x00F000F0UL; x0 = x0 ^ t ^ (t << 4);
t = (x0 ^ (x0 >> 8)) & 0x0000FF00UL; x0 = x0 ^ t ^ (t << 8);
x1 = *(pI++);
t = (x1 ^ (x1 >> 1)) & 0x22222222UL; x1 = x1 ^ t ^ (t << 1);
t = (x1 ^ (x1 >> 2)) & 0x0C0C0C0CUL; x1 = x1 ^ t ^ (t << 2);
t = (x1 ^ (x1 >> 4)) & 0x00F000F0UL; x1 = x1 ^ t ^ (t << 4);
t = (x1 ^ (x1 >> 8)) & 0x0000FF00UL; x1 = x1 ^ t ^ (t << 8);
*(pS++) ^= (x0 & 0x0000FFFF) | (x1 << 16);
*(pS++) ^= (x0 >> 16) | (x1 & 0xFFFF0000);
}
}
There is also the matching deinterleave code in the extract function.
I also wrote a Keccak/SHA3 class using that code (or something similar) as a base. For whatever reason I did not like the way that looked, and ended up writing my own, which ended up being 1.45X faster (just the interleave, not the hash) when compiled compared to my implementation of this interleave code, so there are faster ways to do it.
I also found it is easier to interleave/deinterleave the entire state during dev/testing when you absorb or extract as you can view intermediate values of the state working variables and compare to a reference implementation.