I could not find any mention on the Internet of a proven/known cryptographically secure keyed rolling hash function (ie rolling MAC). Has the question been studied, is it possible to build one ?
By cryptographically secure I mean properties equivalent to HMAC with a cryptographic hash function :
- without knowing the key, knowing the hash value does not leak information about the data,
- knowing or chosing some data does not allow to recover the key more easily than brute force.
By rolling hash, I mean that it can be efficiently iteratively computed on a window sliding byte by byte, ie that there exist an update operation $f$ in the form $H(c_{n-k-1}...c_{n}) = f\left[c_n, c_{n-k-2}, H(c_{n-k-2}...c_{n-1})\right]$, where the $c_i$ are the data characters/bytes, and $k$ is the size of the sliding window.
To be more specific about the application, we want to build a secure chunking algorithm for data deduplication, which produces variable size chunks constructed by splitting data at content-dependent points, and should avoid that the cut points leak information about the data. Existing programs are using solutions based on common rolling hash functions with custom obfuscation, which don't seem to have been well analyzed, for instance :
- rsyncrypto is using the same function than gzip --rsyncable, which is simply the sum of the bytes (obviously weak).
- Attic is using cyclic polynomial (Buzhash) with random secret substitution of input data bytes (a substitution from bytes to 32 bits integers, currently obtained from a public table xored with a secret 32 bits seed, but which is planned by the author to be obtained instead with AES-CTR encryption of a zeroed table ; this substitution corresponds to the function $h$ in the wikipedia page).
- Tarsnap is using Rabbin-Karp hash with random secret substitution of input data bytes and random secret parameters (a substitution from bytes to 32 bits integers using a table of which values are HMAC of indexes, and parameters $a$ and $n$ of the wikipedia page are also depending on HMAC of some fixed data ; in addition the window size of the rolling hash is also only known within a range and is variable).
Are there reasons to believe that some would be more secure than others, in particular Rabbin-Karp vs Buzhash ? Are they known to be preimage resistant ?
Note: The chunking application has the properties that little ciphered data is disclosed (only one hash everytime the cut decision is taken), and that only part of the ciphered data is disclosed (the cut decision is usually taken if the last bits of the hash are null or equal some value, so only these last bits are disclosed).
Performance requirements
For the deduplication application, we would like that in the worst case the processing speed of the hash on an average machine be at least equal to disk read throughput (say around 60MB/s), so that it does not become the bottleneck. Ideally it should be as fast as possible while preserving sufficient security guarantees.
[AES] Implementing D.W.'s answer with buzhash and AES-128 shows that on a high-end modern CPU (Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz) the secure rolling hash runs at 9MB/s (corresponding to raw AES at 140MB/s). If enabling AES-NI using the EVP API of openssl, performance peaks to 56MB/s (corresponding to raw AES at 900MB/s), or even to 150MB/s if computed by batch in ECB mode (computations seem to be parallelized). However we cannot assume yet that an average machine has such instructions : for instance with an older CPU (Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz), performance is limited to 4.4MB/s, and with a low end modern CPU (Intel(R) Atom(TM) CPU N2800 @ 1.86GHz) to 1.5MB/s, which is not acceptable.
[SipHash] According to D.W.'s answer, the requirement for $E$ is that it is a PRF or a PRP with large domain, so SipHash for instance which is claimed to be a PRF should be ok if I'm not mistaken. The secure rolling hash built with it runs at 57MB/s, without using any particular instruction set, or even 97MB/s if specialized for 4 bytes inputs. However this is still way below the 420MB/s obtained with buzhash alone, and on the older CPU mentioned before performance drops to 12MB/s.
[CBC-MAC-AES] Now assume the length of the unsecure rolling hash $R$ is $n$ bits (eg $n=32$), and we only need a secure rolling hash of length $m<n$ (eg $m=16$), because we only want to test if the $m$ last bits are null. AES encryption of $m$ bits blocks (padded to 128 bits) is a PRF, which can be precomputed in a table if $m$ is small enough, and whose domain can be extended to $n$ bits using CBC-MAC to generate a PRF (see introduction of this paper).
Unfortunately after more investigation, the dictionary of the $m$ bits blocks cipher can be cracked in $O(2^m)$ under a CPA with CBC-MAC (finding 3 particular collisions). In our particular context where only collisions of value 0 are revealed, this is not possible, and the best distinguisher requires $O(2^m)$ operations, equivalent to the birthday bound for an $n$ bits blocks cipher, so this scheme would be as secure as an $n$ bits blocks cipher. However $n=32$ is also too small and is not secure, and if increasing $n$ then the best distinguisher still requires only $O(2^m)$ operations (basically because if testing all the values of one $m$ bits word, this scheme will generate exactly one collision with 0, whereas a random function could generate 0,1,2...), so this scheme with CBC-MAC is not IND-CPA with $m=16$ (it would with $m\geq64$ or $128$).