Dynamically building a Merkle tree for unkown size

Question

Assuming a $k$-ary Merkle tree:

If the length of a file is known, based on a fixed chunk size, one could calculate the number of chunks and thus the depth of the tree. Then, based on $k$, the chunk index and the depth, one can parallelize the construction of the Merkle tree.

But if the file size is not known, i.e., the input is a stream, and we are reading the input until EOF, what could be an efficient procedure to build up the $k$-ary Merkle tree, assuming as usual that leaf nodes contain the chunk hash, while branch nodes contain the hashes of their children?

One could first read the input until EOF, then out of this determine the number of chunks and thus build the tree the same way as described in the question. I wonder if one can just read chunks and then send them to a routine in parallel and build the tree as we go... — unsafe_where_true, Mar 26 '17 at 20:29
I'd create a scheme with a fixed block and size (say, 4KiB) and $k$ value, say 16 minimum (to allow 16 worker-threads to calculate the values). You can always create a new root (i.e. make the tree deeper) when you need more space. In other words, I would make the Merkle tree depth a function of the size. But that's just my idea / opinion. — Maarten Bodewes, Mar 26 '17 at 21:53
@MaartenBodewes it took me some time to understand what you mean. But I think your basic proposal is what I will need. Thanks. (You can post it as an answer if you like, I didn't post any code either so I may accept your generic idea, but I'll give myself some time) — unsafe_where_true, Mar 29 '17 at 17:31

score 2 · Answer 1 · answered Jan 23 '23 at 01:53

This is very simple to achieve. I'll just show you what you have to compute as each chunk $c_1, c_2, c_3, \ldots$ comes in.

You start empty.

You get $c_1$, hash it as $h_1$, and obtain your Merkle root. If no more chunks come in, you are done.

    h_1
     |
    c_1

Otherwise, you get $c_2$, hash it as $h_2$. You put it next to $h_1$. Now you have two leaves, hash them and get their parent hash $r_{1,2}$:

     r_{1,2}
    /      \
 h_1       h_2
  |         |
 c_1       c_2

Maybe you are done now? If not. You get $c_3$. Now you have a forest.

     r_{1,2}
    /      \
 h_1       h_2     h_3
  |         |       |
 c_1       c_2     c_3

If you are done, you can let your final digest be the hash of all the roots in this forest.

If not, you get $c_4$. Now, you can start merging things in your forest, by hashing.

            r_{1,4}
        /             \
     r_{1,2}        r_{3,4}
    /      \        /     \
 h_1       h_2     h_3   h_4
  |         |       |     |
 c_1       c_2     c_3   c_4

By now, you get the picture?

You always append new chunks to the right of your forest.
Whenever you have two subtrees of the same size (i.e., same # of leaves), you merge them by Merkle hashing their roots together
Your file's final digest will be the hash of all roots in your forest

$O(1)$ digest. Because you always hash the roots together into a single digest at the end, this approach gives you a constant-sized digest, just like a Merkle tree.

Computationally-optimized. This forest-based approach will minimize the # of hashes you compute.

Why forest? You could also maintain a tree instead of a forest over all chunks received so far, but that will unnecessarily recompute a lot of hashes as new chunks get added "under the tree."

Number of trees? For $n$ chunks, you will not have more than $\log_2{n}$ trees in your forest.

Space-complexity optimization: Furthermore, you could actually implement this without keeping track of all the trees; just of their roots!

Dynamically building a Merkle tree for unkown size

1 Answers1