6

Assuming a $k$-ary Merkle tree:

If the length of a file is known, based on a fixed chunk size, one could calculate the number of chunks and thus the depth of the tree. Then, based on $k$, the chunk index and the depth, one can parallelize the construction of the Merkle tree.

But if the file size is not known, i.e., the input is a stream, and we are reading the input until EOF, what could be an efficient procedure to build up the $k$-ary Merkle tree, assuming as usual that leaf nodes contain the chunk hash, while branch nodes contain the hashes of their children?

Alin Tomescu
  • 1,003
  • 10
  • 30
  • One could first read the input until EOF, then out of this determine the number of chunks and thus build the tree the same way as described in the question. I wonder if one can just read chunks and then send them to a routine in parallel and build the tree as we go... – unsafe_where_true Mar 26 '17 at 20:29
  • 3
    I'd create a scheme with a fixed block and size (say, 4KiB) and $k$ value, say 16 minimum (to allow 16 worker-threads to calculate the values). You can always create a new root (i.e. make the tree deeper) when you need more space. In other words, I would make the Merkle tree depth a function of the size. But that's just my idea / opinion. – Maarten Bodewes Mar 26 '17 at 21:53
  • @MaartenBodewes it took me some time to understand what you mean. But I think your basic proposal is what I will need. Thanks. (You can post it as an answer if you like, I didn't post any code either so I may accept your generic idea, but I'll give myself some time) – unsafe_where_true Mar 29 '17 at 17:31

1 Answers1

2

This is very simple to achieve. I'll just show you what you have to compute as each chunk $c_1, c_2, c_3, \ldots$ comes in.

You start empty.

You get $c_1$, hash it as $h_1$, and obtain your Merkle root. If no more chunks come in, you are done.

    h_1
     |
    c_1

Otherwise, you get $c_2$, hash it as $h_2$. You put it next to $h_1$. Now you have two leaves, hash them and get their parent hash $r_{1,2}$:

     r_{1,2}
    /      \
 h_1       h_2
  |         |
 c_1       c_2

Maybe you are done now? If not. You get $c_3$. Now you have a forest.

     r_{1,2}
    /      \
 h_1       h_2     h_3
  |         |       |
 c_1       c_2     c_3

If you are done, you can let your final digest be the hash of all the roots in this forest.

If not, you get $c_4$. Now, you can start merging things in your forest, by hashing.

            r_{1,4}
        /             \
     r_{1,2}        r_{3,4}
    /      \        /     \
 h_1       h_2     h_3   h_4
  |         |       |     |
 c_1       c_2     c_3   c_4

By now, you get the picture?

  • You always append new chunks to the right of your forest.
  • Whenever you have two subtrees of the same size (i.e., same # of leaves), you merge them by Merkle hashing their roots together
  • Your file's final digest will be the hash of all roots in your forest

$O(1)$ digest. Because you always hash the roots together into a single digest at the end, this approach gives you a constant-sized digest, just like a Merkle tree.

Computationally-optimized. This forest-based approach will minimize the # of hashes you compute.

Why forest? You could also maintain a tree instead of a forest over all chunks received so far, but that will unnecessarily recompute a lot of hashes as new chunks get added "under the tree."

Number of trees? For $n$ chunks, you will not have more than $\log_2{n}$ trees in your forest.

Space-complexity optimization: Furthermore, you could actually implement this without keeping track of all the trees; just of their roots!

Alin Tomescu
  • 1,003
  • 10
  • 30