What is the purpose of using different hash functions for the leaves and internals of a hash tree?

Question

I just learned that the THEX hash tree specification which is widely used in P2P requires that two different hash functions be used: one for the leaf nodes (hashes of input data) and one for the internal hashes (hashes of hashes).

In order to protect against collisions between leaf hashes and internal hashes, different hash constructs are used to hash the leaf nodes and the internal nodes. The same hash algorithm is used as the basis of each construct, but a single '1' byte in network byte order, or 0x01 is prepended to the input of the internal node hashes, and a single '0' byte, or 0x00 is prepended to the input of the leaf node hashes.

By contrast, the proposed (though not yet widely adopted) Simple Merkle Hashes extension for BitTorrent just uses unmodified SHA-1 for all hashes. It's conceivable that this was a trade-off of security for simplicity, but that wasn't mentioned in the proposal.

What is the benefit of using two different hashes in this scheme?

score 21 · Accepted Answer · answered Mar 17 '12 at 15:02

21

If the same standard hash function was used for both leaves and branch nodes, it would be easy to generate collisions and even second preimages.

For example, let $M$ be a message which is longer than the segment size of the hash tree, but (for simplicity) no more than two segments long. Then the hash value of $M$ is calculated as $$H(M) = H_I(H_L(M_0) \,\|\, H_L(M_1)),$$ where $H_I$ is the internal hash function, $H_L$ is the leaf hash function, and $M_0$ and $M_1$ are the first and second segments of $M$. We'll also assume that the segment size of the hash tree is at least twice as long as the length of the output of $H_L$.

Now let $M' = H_L(M_0) \,\|\, H_L(M_1)$. Since $M'$ is, by assumption, at most one segment long, its hash will be calculated as $$H(M') = H_L(M') = H_L(H_L(M_0) \,\|\, H_L(M_1)).$$ If $H_L = H_I$, then $H(M') = H(M)$, and we've just found a second preimage for $H(M)$.

answered Mar 17 '12 at 15:02

Ilmari Karonen

46,120
5
105
181

1

I realized that the reason BitTorrent's Simple Merkel Trees aren't vulnerable to this is that they're padded full by adding 0-leaves (until the number of them reaches a power of two). On the other hand THEX allows unpaired leaf hashes to "float up" the tree into a spot where an internal hash would otherwise be expected. – Jeremy Mar 17 '12 at 15:20
2

Just padding the number of segments to a power of 2 doesn't help: note that $M$ and $M'$ in my example have 2 and 1 segments respectively. However, padding the last segment itself to full length would indeed suffice to differentiate leaf hashes from internal hashes, as long as the segment length is strictly greater than the internal hash input length. (But do note that such padding itself, if not done carefully enough, may allow the construction of collisions.) – Ilmari Karonen Mar 17 '12 at 15:32
3

Oops! Thank you for the clarification. New interpretation: The trees are secure in the context of BitTorrent because it already knows the total length of the input, so it's impossible for an attacker to add or remove leaves as would be necessary. – Jeremy Mar 17 '12 at 15:51
I don't get it. If you are going to apply H to M', then you have to do H_L(H_L(N_0)||H_L(N_1)) (N_0 & N_1 are the left & right halves of M', respectively). That's going to be a different result than H_L(M'). – Melab Jul 10 '23 at 14:15
@Melab: "We'll also assume that the segment size of the hash tree is at least twice as long as the length of the output of $H_L$." – Ilmari Karonen Jul 11 '23 at 16:58

CodesInChaos · Answer 2 · 2012-03-17T16:15:55.813

The torrent tree hash is vulnerable to second pre-image attacks by itself, even with 00 padding. I won't repeat Ilmari Karonen's answer, who already explained that part very well.

But it isn't used to identify the data by itself:

The original publisher of the content-file set creates a so-called Merkle torrent which is a torrent file that contains a root hash key in its info part instead of a pieces key

This means the infohash, which serves as a unique ID for a torrent isn't just based on the root of the tree, but also includes the filenames and filesizes which are kept in the info dictionary.

Knowing the total size of the torrent prevents such attacks. I still don't like their design decision, since it can easily lead to bugs in torrent clients which aren't aware of this issue. For example if a client forgot to validate the size of the piece, and only checked the hash, it's still be vulnerable.

IMO this design is also flawed in non security related ways.

In particular, the hash tree still crosses file boundaries. Having one root per file would have been nicer. But I guess they tried to stay as close as possible to the original format.

And they decided to use the leaf-size as the piece-size. I would have chosen a small constant leaf-size and leave the piece-size independent from the leaf-size. This would allow changes in piece-size without changing the hash.

What is the purpose of using different hash functions for the leaves and internals of a hash tree?

2 Answers2

Linked