This is very simple to achieve. I'll just show you what you have to compute as each chunk $c_1, c_2, c_3, \ldots$ comes in.
You start empty.
You get $c_1$, hash it as $h_1$, and obtain your Merkle root.
If no more chunks come in, you are done.
h_1
|
c_1
Otherwise, you get $c_2$, hash it as $h_2$.
You put it next to $h_1$. Now you have two leaves, hash them and get their parent hash $r_{1,2}$:
r_{1,2}
/ \
h_1 h_2
| |
c_1 c_2
Maybe you are done now? If not.
You get $c_3$.
Now you have a forest.
r_{1,2}
/ \
h_1 h_2 h_3
| | |
c_1 c_2 c_3
If you are done, you can let your final digest be the hash of all the roots in this forest.
If not, you get $c_4$.
Now, you can start merging things in your forest, by hashing.
r_{1,4}
/ \
r_{1,2} r_{3,4}
/ \ / \
h_1 h_2 h_3 h_4
| | | |
c_1 c_2 c_3 c_4
By now, you get the picture?
- You always append new chunks to the right of your forest.
- Whenever you have two subtrees of the same size (i.e., same # of leaves), you merge them by Merkle hashing their roots together
- Your file's final digest will be the hash of all roots in your forest
$O(1)$ digest. Because you always hash the roots together into a single digest at the end, this approach gives you a constant-sized digest, just like a Merkle tree.
Computationally-optimized. This forest-based approach will minimize the # of hashes you compute.
Why forest? You could also maintain a tree instead of a forest over all chunks received so far, but that will unnecessarily recompute a lot of hashes as new chunks get added "under the tree."
Number of trees? For $n$ chunks, you will not have more than $\log_2{n}$ trees in your forest.
Space-complexity optimization: Furthermore, you could actually implement this without keeping track of all the trees; just of their roots!