2

I have a scenario similar to the one described in Wikipedia: hash list, but with a twist. I'm looking for a cryptographically secure hash function that would create the same root hash for the same file, no matter how the the file is chopped up for the individual hashes in the list.

E.g. case 1: File is divided into 3 parts; the hash list consists of the hashes for the 3 parts; the root-hash is computed from the 3 hashes. case 2: Same file is divided into 2 parts; the hash list consists of the hashes computed for the 2 parts, the root hash is computed from the 2 hashes. Since it is the same file I want the root hash to be the same.

Is this doable (maybe with some restrictions on number and size of file parts)?

[Edit] Specific use case: My system stores files for users. Large files are usually sent / stored in smaller chunks (currently I don't control in which way the files are split up into chunks). Each chunk is encrypted beforehand by the client, but accompanied by a hash of the unencrypted content. I now would like to know if two users upload the same file (as this allows me to do some optimization) without having to know the content of the file. So if I could compute a "hash" of the whole file by using the individual chunk hashes I could easily achieve this.

David Cary
  • 5,664
  • 4
  • 21
  • 35
user12889
  • 145
  • 4
  • Maybe I'm wrong but, you're probably better off creating a hash value for whole file first before you separate and hash the individual parts. Kind of like a root hash. That way, no matter how many times you separate it, it will still have the same hash value when you reconstruct it. I believe the whole purpose of hashing is to prevent the scenario that you depict in your examples. –  Aug 23 '11 at 08:10
  • 2
    Why do you want to know if two users upload the same file if you don't know the content of the file? You can't exactly give the ciphertext from one user back to another without also giving him the decryption key. – Marsh Ray Aug 24 '11 at 04:18
  • @Marsh Good point, and I have to see how effective it will be in real life. In my application I think there is some hope: The system is providing a secure space for collaboration. The keys are being managed centrally (but independently of the storage). So, within a collaborating group it is possible to give one user a reference to data and keys of another user. – user12889 Aug 25 '11 at 01:09
  • 1
    I don't understand the problem statement. Do you control the hashing code on the client? If yes, why not just ask the client to send the hash of the entire file? If no, none of the solutions proposed here are going to work (because they all require modifying the way that the client computes hashes.) – D.W. Aug 28 '11 at 08:00

4 Answers4

6

A Hash Tree is meant for that. A binary tree seems fit. I'll restrict the description to something directly derived from SHA-256 (256-bits output, 512-bit hashed per round).

  1. A parameter n>0 is selected, defining a "superblock" size of n*512 bits. Say 8192 bits (1kB, for n=16); n=1 works, but a higher value improves computing efficiency markedly.

  2. The file is padded as in SHA-256, and conceptually organized into m superblocks of that size (the last superblock's size might be shorter, but is mutiple of 512 bits; the padding might be in the last superblock, or span the last two superblocks).

  3. The file is chopped in segments of consecutive superblocks, with the limit of segments on superblock boundaries. Each computation point is assigned a segment which it receives (or generates and pads).

  4. Each computation point separately hash each superblock in its assigned segment, using SHA-256 less padding. Each superblock requires n hash rounds (except the last which may require less). Most of the rounds are performed here (exactly as many as for SHA-256 of the whole file), in a distributed manner.

    All the hashes obtained at step 4 form the m leaves on top of a single overall binary tree with a 256-bit hash at each node, and structure independent of how the file was chopped. Each of the m-1 non-leaf node of the tree will be the hash, obtained with one round, of the 512 bits in the two nodes linked on its top-left and top-right. The bottom node of the tree will be the final hash. It will be independent on how the file was chopped at step 3, because the computation will be performed according to the same tree regardless of the chopping.

    [drawing of tree needed]

    All branches in the tree join adjacent levels, except on the right where the j+1th level from the top is skipped when the jth lower-order bit of m-1 is 0.

  5. The computation point responsible for a segment computes the hashes for nodes which have all their leaves assigned to that segment, and must keep the others. This needs precise organization [and is left as an exercise to the reader; my earlier description was flawed].

  6. Each computation point returns its partial result. That will include Ceil(Log2(k+1)) 256-bit hashes at most, for the partial hash of k superblocks. If the communication is centralized, the central point finishes the calculation, re-hashing according to the same binary tree as necessary. With decentralized communication, it is advantageous to aggregate partial results with a peer handling a segment of the file just before or after, which might allow to perform slightly more of the work.

Note: Computations can be interleaved to reduce storage requirements. There is a total of m-1 extra rounds in steps 5 and 6, an overhead of about 1/n, distributed in part.

fgrieu
  • 140,762
  • 12
  • 307
  • 587
  • 2
    This is a nice description of distributed tree hashing, but I fail to see how this helps for "independently of how we chop up the file, we'll get the same results" - you are chopping it up the same way each time, don't you? – Paŭlo Ebermann Aug 23 '11 at 10:54
  • 1
    You can populate the tree any way you want. So long as the final tree is the same in all cases, the hashed output will be the same. This generally solves the real problem people have when they ask the question asked here. Skein has a specific mode for this. – David Schwartz Aug 23 '11 at 11:20
  • 1
    Okay, now it looks better. Some details: Why do you do 2^n hashes in step 3? I would think n should be enough. Why do you have this special-casing for the last block? – Paŭlo Ebermann Aug 23 '11 at 11:28
  • Thanks for the very detailed answer. I'll have to do some thinking if I can apply it to my use case. I'll add the use case to the questions as well to make it more specific. – user12889 Aug 23 '11 at 23:26
  • Look at the description in the Skein specification. It does not show distribution to multiple "parties", but multiple processors. – Paŭlo Ebermann Aug 24 '11 at 00:02
  • @Paŭlo Ebermann : 2^n was a leftover of the original version of my answer where the superblock size was 5122^n bits rather than 512n now. Now I have started a rewrite of the whole thing, as the earlier description was wrong. This is tricky and I wish I could just link to a reference (thanks for pointing Skein), or leave it to someone else to write the answer! – fgrieu Aug 24 '11 at 00:08
  • I don't think this answer fully solves the problem of ensuring that you can compute the root hash from the segment hashes, regardless of how the file is partitioned into segments. In this scheme, it's not enough to retain a single hash of the segment: one has to retain O(log k) hashes. This has performance implications, and may also have security implications, potentially making it easier to guess the contents of part of a segment given the hash information for that segment. – D.W. Aug 28 '11 at 07:56
  • @D.W. Good points. In the best case -- a segment of length 2^n blocks, and starts and ends on a boundary that is a multiple of 2^n blocks -- the user only needs to send 1 hash with (the encrypted version of) that segment to the server. In the worst case -- a segment of of length (2^n - 1) blocks -- the user needs to send O(log k) hashes, as you said, but the server doesn't need to "retain" any of them -- once it has all the hashes for all parts of the file, the server can calculate the one root hash and then discard all the other hashes. – David Cary Aug 30 '11 at 10:58
6

The property you want is inconsistant with the definition of a cryptographically-secure hash function.

If $\mathcal{H}'(\mathcal{H}(\mathrm{half}_1),\mathcal{H}(\mathrm{half}_2)) = \mathcal{H}'(\mathcal{H}(\mathrm{third}_1),\mathcal{H}(\mathrm{third}_2),\mathcal{H}(\mathrm{third}_3))$, finding second-preimages is as trivial as repartitioning the original message. If you consider either $\mathcal{H}$ or $\mathcal{H'}$ (or both) as random oracles, that may assist is seeing why the scheme won't work.

Note: in a hash tree, messages are processed identically regardless of partitioning (halves would be processed the same as quarters, so it is related, but not on target for your use case). If the depth of the tree were agreed upon in advanced, then the partitioning could be as well. In your use case, your stuck with hashes of the partitions, and so you cannot accumulate them without the hash $\mathcal{H}'$ "knowing" something about the preimages of $\mathcal{H}$, which is against the security definition of $\mathcal{H}$.

PulpSpy
  • 8,617
  • 1
  • 30
  • 46
2

Not an answer to your question, but a security point. Revealing simple hashes of unencrypted content is a security vulnerability. (Revealing HMACs or anything with a secret component do not pose the same vulnerability, even if they are calculated from the unencrypted content.)

For example, consider the hash of a configuration file, whose contents are mostly known, but differ only for a 8 character password. It becomes really easy to iterate through all the possible passwords and check which hash corresponds to the one sent. Or simply the case of copyright protected content being stored there - if the plaintext hash is revealed, anybody can see if you've stored the content or not.

If the unencrypted content hashes are revealed only to the service provider (you?), this means that you can mount these types of attacks against the users - or anybody able to force your hand. This is something that the users would have to be aware of, and it kind of undermines the benefit of not having the service provider have access to the encryption key.

In general, doing deduplication between different users cannot be done securely - it always leaks information and the leak may be critical. Doing deduplication for the single user among his own files leaks information as well, but this leakage is often small enough to be ignored entirely.

Nakedible
  • 1,440
  • 11
  • 15
-1

I have a way, but it's really awful. You can individually hash the combination of each byte's position in the file and the content of that byte, and then XOR all the hashes together.

If you have some control over the block arrangement, you can do a bit better. For example, if the blocks are always at least aligned on 1KB boundaries, you can individually hash each KB using its offset into the file as a HMAC key. Then XOR all the block hashes together. You can accomplish the XORs any way you want, so you can XOR each block and then XOR all the blocks, allowing a block of any size to carry a fixed-length hash value that can easily be combined.

Note that XORing hashes together weakens the security properties of some hashes. I would not recommend this technique if it has to resist a deliberate attack specifically intended to defeat this algorithm. It may take an expert to assure you that the resulting hash algorithm truly is a cryptographically secure hash algorithm

David Schwartz
  • 4,729
  • 19
  • 31
  • I'm very doubtful about the security. A variant of the attack in this answer seems possible. – fgrieu Aug 24 '11 at 10:36
  • That particular attack won't quite work, because it's HMACed with the offset. (So you can't easily combine values.) However, it's always possible a similar attack would work. – David Schwartz Aug 24 '11 at 10:59
  • If you use the block offset appended with a secret key as the HMAC, then an attacker (not knowing the secret key) can't launch an attack of that kind. That assumes that all entities that need to participate in calculating the hashes can be given a secret that no attacker can know. – David Schwartz Aug 24 '11 at 11:10
  • Ah, if you throw in a secret key, that's different. The adversary will need some number of HMAC examples with the same key. If you somewhat prevent the reuse of key for different files, you might be on something. – fgrieu Aug 24 '11 at 11:18
  • Yeah, I think you could easily rig such a system. You'd need to know a bit more about the specifics of the use case. The problem is, while it would be sufficiently strong that I can't think of any weaknesses in it, that's a far cry from being confident that it is anywhere near as secure as the underlying hash function. If that is a requirement, it's going to be an expensive problem to solve. – David Schwartz Aug 24 '11 at 11:28
  • 2
    I share @fgrieu's concerns about security. If the key is ever reused, this becomes vulnerable to linear-algebra based attacks. However it can be fixed if you make each hash output be an integer modulo a 2048-bit prime (not a 160-bit value), and if you multiply these values modulo the prime (instead of XORing them). This has been analyzed by Bellare et al. – D.W. Aug 28 '11 at 07:59